A LANDSCAPE OF THE NEW DARK SILICON DESIGN REGIME
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
................................................................................................................................................................................................................... A LANDSCAPE OF THE NEW DARK SILICON DESIGN REGIME ................................................................................................................................................................................................................... THE RISE OF DARK SILICON IS DRIVING A NEW CLASS OF ARCHITECTURAL TECHNIQUES THAT ‘‘SPEND’’ AREA TO ‘‘BUY’’ ENERGY EFFICIENCY. THIS ARTICLE EXAMINES FOUR RECENTLY PROPOSED DIRECTIONS (‘‘THE FOUR HORSEMEN’’) FOR ADAPTING TO DARK SILICON, OUTLINES A SET OF EVOLUTIONARY DARK SILICON DESIGN PRINCIPLES, AND SHOWS HOW ONE OF THE DARKEST COMPUTING ARCHITECTURES—THE HUMAN BRAIN—OFFERS INSIGHTS INTO MORE REVOLUTIONARY DIRECTIONS FOR COMPUTER ARCHITECTURE. ...... Recent VLSI technology trends have led to a disruptive new regime for dig- A recent paper refers to this widespread disruptive factor informally as the ‘‘dark sili- ital chip designers, where Moore’s law con- con apocalypse,’’4 because it officially marks tinues but CMOS scaling provides the end of one reality (Dennard scaling5), increasingly diminished fruits. As in prior where progress could be measured by years, the computational capabilities of improvements in transistor speed and chips are still increasing by 2.8 per process count, and the beginning of a new reality generation. However, a utilization wall1 lim- (post-Dennard scaling), where progress is its us to only 1.4 of this benefit—causing measured by improvements in transistor en- large underclocked swaths of silicon area— ergy efficiency. Previously, we tweaked our Michael B. Taylor hence the term dark silicon.2,3 circuits to reduce transistor delays and Fortunately, simple scaling theory makes turbo-charged them with dual-rail domino University of California, the utilization wall easy to derive, helping us to reduce fan-out-of-4 (FO4) delays. From to think intuitively about the problem. Tran- now on, we will tweak our circuits to mini- San Diego sistor density continues to improve by 2 mize capacitance switched per function; we every two years, and native transistor speeds will strip our circuits down and starve them improve by 1.4. But transistor energy effi- of voltage to squeeze out every femtojoule. ciency improves by only 1.4, which, under Whereas once we would spend exponentially constant power budgets, causes a 2 shortfall increasing quantities of transistors to buy in energy budget to power a chip at its native performance, now we will spend these tran- frequency. Therefore, our utilization of a sistors to buy energy efficiency. chip’s potential is falling exponentially by a The CMOS scaling breakdown was the jaw-dropping 2 per generation. Thus, if direct cause of industry’s transition to multi- we are just bumping up against power limita- core in 2005. Because filling chips with cores tions in the current generation, then in eight does not fundamentally circumvent utiliza- years, designs will be 93.75 percent dark! tion wall limits, multicore is not the final ....................................................... 8 Published by the IEEE Computer Society 0272-1732/13/$31.00 c 2013 IEEE
solution to dark silicon;3 it is merely indus- Table 1. Dennard vs. post-Dennard (leakage-limited) scaling.1 In try’s initial, transitional response to the contrast to Dennard scaling,5 which held until 2005, under the shocking onset of the dark silicon age. In- post-Dennard regime, the total chip utilization for a fixed power creasingly over time, the semiconductor in- budget drops by S 2 with each process generation. The result is an dustry is adapting to this new design exponential increase in dark silicon for a fixed-sized chip under a regime, realizing that multicore chips will fixed area budget. not scale as transistors shrink and that the fraction of a chip that can be filled with Transistor property Dennard Post-Dennard cores running at full frequency is dropping D Quantity S2 S2 exponentially with each process genera- D Frequency S S tion.1,3 This reality forces designers to ensure D Capacitance 1/S 1/S that, at any point in time, large fractions of 2 VDD 1=S 2 1 their chips are effectively dark—either idle ) D Power ¼ D QFCV 2 1 S2 for long periods of time or significantly ) D Utilization ¼ 1/Power 1 1=S 2 underclocked. As exponentially larger frac- tions of a chip’s transistors become darker, silicon area becomes an exponentially cheaper resource relative to power and energy This shortfall prevents multicore from consumption. This shift calls for new archi- being the solution to scaling.1,3 Although tectural techniques that ‘‘spend’’ area to advancing a single process generation would ‘‘buy’’ energy efficiency. This saved energy allow enough transistors to increase core can then be applied to increase performance, count by 2, and frequency could be 1.4 or to have longer battery life or lower operat- faster, the energy budget permits only a ing temperatures. 1.4 total improvement. Per Figure 1, across two process generations (S ¼ 2), designers The utilization wall that causes dark silicon could increase core count by 2 leaving fre- Table 1 shows the derivation of the utiliza- quency constant, or they could increase fre- tion wall1 that causes dark silicon.2,3 It quency by 2 with leaving core count employs a scaling factor, S, which is the constant, or they could choose some middle ratio between the feature sizes of two processes ground between the two. The remaining 4 (for example, S ¼ 32=22 ¼ 1:4x between 32 potential remains inaccessible. and 22 nm). In both Dennard and post- More positively stated, the true new poten- Dennard scaling, the transistor count scales tial of Moore’s law is a 1.4 energy-efficiency by S 2 , and the transistor switching frequency improvement per generation, which could be scales by S. Thus, our net increase in comput- used to increase performance by 1.4. Addi- ing performance is S 3 , or 2.8x. tionally, if we could somehow make use of However, to maintain a constant power dark silicon, we could do even better. envelope, these gains must be offset by a cor- Although the utilization wall is based on a responding reduction in transistor switching first-order model that simplifies many fac- energy. In both cases, scaling reduces transis- tors, it has proved to be an effective tool tor capacitance by S, improving energy effi- for designers to gain intuition about the fu- ciency by S. In Dennard scaling, we can ture, and has proven remarkably accurate scale the threshold voltage and thus the oper- (see the sidebar ‘‘Is Dark Silicon Real? A Re- ating voltage, which yields another S 2 energy- ality Check’’). Follow-up work6-8 has looked efficiency improvement. However, in today’s at extending this early work1,3 on dark sili- post-Dennard, leakage-limited regime, we con and multicore scaling with more sophis- cannot scale threshold voltage without expo- ticated models that incorporate factors such nentially increasing leakage, and as a result, as application space and cache size. we must hold operating voltage roughly con- stant. The end result is a shortfall of S 2 , or 2 Dark silicon misconceptions per process generation. This shortfall multi- Let’s clear up a few misconceptions before plies with each process generation, resulting proceeding. First, dark silicon does not mean in exponentially darker silicon over time. blank, useless, or unused silicon; it’s just ............................................................. SEPTEMBER/OCTOBER 2013 9
............................................................................................................................................................................................... DARK SILICON ............................................................................................................................................................................................... Is Dark Silicon Real? A Reality Check A quick survey of recent designs from multicore outfits such as Tilera, instruction throughput. Instead, the upcoming 2013 22-nm Intel Core i7 Intel, and AMD indicates that industry has pursued core count and fre- 4960X runs at 3.6 GHz and has six superscalar cores, a 5.7 peak serial quency combinations consistent with the utilization wall. For instance, instruction throughput improvement. The darkness ratio is thus 91.74 per- Intel’s 90-nm single-core Prescott chip ran at 3.8 GHz in 2004. Dennard cent versus the 93.75 percent predicted by the utilization wall. The latest scaling would suggest that a 22-nm multicore version should run at 15.5 2012 International Technology Roadmap for Semiconductors also shows GHz, and contain 17 superscalar cores, for a total improvement of 69 in that scaling has proceeded consistently with post-Dennard predictions. Spectrum of trade-offs .. between no. of cores and .. frequency 2×4 cores at 1.8 GHz Example: (8 cores dark, 8 dim) 65 nm → 32 nm (S = 2) (Industry’s choice) .. .. 4 cores at 1.8 GHz .. 4 cores at 2×1.8 GHz .. (12 cores dark) 75% dark after two generations; 93% dark after four generations 65 nm 32 nm Figure 1. Multicore scaling leads to large amounts of dark silicon.3 Across two process gen- erations, there is a spectrum of trade-offs between frequency and core count; these include increasing core count by 2 but leaving frequency constant (top), and increasing frequency by 2 but leaving core count constant (bottom). Any of these trade-off points will have large amounts of dark silicon. silicon that is not used all the time, or at its energy efficiency can then allow an indirect full frequency. Even during the best days of performance improvement because it frees CMOS scaling, microprocessor and other up more of the fixed power budget to be circuits were chock full of ‘‘dark logic’’ used for even more computation. used infrequently or for only some applica- tions—for instance, caches are inherently The four horsemen dark because the average cache transistor is Recently, researchers proposed a taxon- switched for far less than one percent of omy—the four horsemen—that identifies cycles, and FPUs remain dark in integer four promising directions for dealing with codes. dark silicon that have emerged as promising Soon, the exponential growth of dark sil- potential approaches as we transition beyond icon area will push us beyond logic targeted the initial multicore stop-gap solution. These for direct performance benefits toward responses originally appeared to be unlikely swaths of low-duty cycle logic that exists, candidates, carrying unwelcome burdens in not for direct performance benefit, but for design, manufacturing, or programming. improving energy efficiency. This improved None is ideal from an aesthetic engineering ............................................................. 10 IEEE MICRO
point of view. But the success of complex low-margin, high-competition markets, and multiregime devices such as metal-oxide- their competitor will take the high end and semiconductor field-effect transistors (MOS- enjoy high margins. Thus, in scenarios FETs) has shown that engineers can tolerate where dark silicon could be used profitably, complexity if the end result is better. Future decreasing area in lieu of exploiting it chips are likely to employ not just one horse- would certainly decrease system costs, but man, but all of them, in interesting and would catastrophically decrease sale price. unique combinations. Hence, the shrinking-chips scenario is likely to happen only if we can find no practical The shrinking horseman use for dark silicon. When confronted with the possibility of dark silicon, many chip designers insist that Power and packaging issues with shrinking area is expensive, and that they would just chips. A major consequence of exponentially build smaller chips instead of having dark sil- shrinking chips is a corresponding exponen- icon in their designs. Among the four horse- tial rise in power density. Recent analysis of men, these ‘‘shrinking chips’’ are the most many-core thermal characteristics has pessimistic outcome. Although all chips shown that peak hotspot temperature rise may eventually shrink somewhat, the ones can be modeled as Tmax ¼ TDP ðRconv þ that shrink the most will be those for k=AÞ, where Tmax is the rise in temperature, which dark silicon cannot be applied fruit- TDP is the target chip thermal design power, fully to improve the product. These chips Rconv is the heat sink thermal convection re- will rapidly turn into low-margin businesses sistance (lower is a better heat sink), k incor- for which further generations of Moore’s porates many-core design properties, and A is law provide small benefit. Below is an exam- chip area.8 If area drops exponentially, the ination of the spectrum of second-order second term dominates and chip tempera- effects associated with shrinking chips. tures rise exponentially. This in turn will force a lower TDP so that temperature limits Cost side of shrinking silicon. Understanding are met, and reduce scaling below even the shrinking chips requires considering semi- nominal 1.4 expected energy-efficiency conductor economics. The ‘‘build smaller gain. Thus, if thermals drive your shrink- chips’’ argument has a ring of truth; after ing-chip strategy, it is much better to hold all, designers spend much of their time trying your frequency constant and increase cores to meet area budgets for existing chip by 1.4 with a net area decrease of 1.4 designs. But exponentially smaller chips are than it is to increase your frequency by not exponentially cheaper; even if silicon 1.4 and shrink your chip by 2. begins as 50 percent of system cost, after a few process generations, it will be a tiny frac- The dim horseman tion. Mask costs, design costs, and I/O pad As exponentially larger fractions of a chip’s area will fail to be amortized, leading to ris- transistors become dark transistors, silicon ing costs per mm2 of silicon, which ulti- area becomes an exponentially cheaper re- mately will eliminate incentives to move source relative to power and energy consump- the design to the next process generation. tion. This shift calls for new architectural These designs will be ‘‘left behind’’ on techniques that spend area to buy energy effi- older generations. ciency. If we move past unhappy thoughts of shrinking silicon and consider populating Revenue side of shrinking silicon. Shrinking dark silicon area with logic that we use only silicon can also shrink the chip selling part of the time, then we are led to some in- price. In a competitive market, if there is a teresting new design possibilities. way to use the next process generation’s The term dim silicon refers to techniques bounty of dark silicon to attain a benefit to that put large amounts of otherwise-dark the end product, then competition will silicon area to productive use by employing force companies to do so. Otherwise, they heavy underclocking or infrequent use will generally be forced into low-end, to meet the power budget—that is, the ............................................................. SEPTEMBER/OCTOBER 2013 11
............................................................................................................................................................................................... DARK SILICON architecture is strategically managing the explored wide-SIMD NTV processors,12 chip-wide transistor duty cycle to enforce which seek to exploit data parallelism, the overall power constraint.8,9 Whereas along with NTV many-core processors13 early 90-nm designs such as Cell and Pre- and an NTV x86 processor.14 scott were dimmed because actual power Although NTV per-processor performance exceeded design-estimated power, we are drops faster than the corresponding savings in converging on increasingly more elegant energy-per-instruction (5 energy improve- methods that make better trade-offs. ment for an 8 performance cost), the perfor- Dim silicon techniques include dynami- mance loss can be offset by using 8 more cally varying the frequency with the number processors in parallel if the workload allows of cores being used, scaling up the amount of it. Then, an additional 5 processors could cache logic, employing near-threshold volt- turn the energy efficiency gains into additional age (NTV) processor designs, and redesign- performance. So, with ideal parallelization, ing the architecture to accommodate bursts NTV could offer 5 the throughput im- that temporarily allow the power budget to provement by absorbing 40 the area. But be exceeded, such as Turbo Boost and com- this would also require 40 more free paral- putational sprinting.10 lelism in the workload relative to the parallel- ism consumed by an equivalent energy- Turbo Boost 1.0. Although first-generation limited super-threshold many-core processor. multicores had a ship-time-determined top In practice, for many applications, 40 frequency that was invariant of the number additional parallelism can be elusive. For of currently active cores, Intel’s Turbo chips with large power budgets that can al- Boost 1.0 enabled second-generation multi- ready sustain hundreds of cores, applications cores to make real-time trade-offs between that have this much spare parallelism are rel- active core count and the frequency the atively rare. Interestingly, because of this ef- cores ran at: the fewer the cores, the higher fect, NTV’s applicability across applications the frequency. When Turbo Boost is enabled, increases in low-energy environments because it uses the energy gained from turning off the energy-limited baseline super-threshold cores to increase the voltage and then the fre- design has consumed less of the available par- quency of the active cores. This technique, allelism. Furthermore, NTV clearly becomes known as dynamic voltage and frequency more applicable for workloads with extremely scaling (DVFS), increases power proportional large amounts of parallelism. to the cube of the increase in frequency. NTV presents several circuit-related chal- lenges that have seen active investigation, es- NTV processors. In the past, DVFS was also pecially because technology scaling will used to save cubic power when frequencies exacerbate rather than ameliorate these factors. were decreased. However, today, processor A significant NTV challenge has been suscep- manufacturers operate transistors at reduced tibility to process variability. As operating vol- voltages—around 2.5 the threshold volt- tages drop, variation in transistor threshold age, an energy-delay optimal point. This due to random dopant fluctuation is propor- point is right at the edge of an operating re- tionally higher, and leakage and operating fre- gime where frequency starts to drop precipi- quency can vary greatly. Because NTV tously as voltage is reduced, which makes designs can expand the area consumption by downward-DVFS much less effective. approximately 8 or more, variation issues Nonetheless, researchers have begun to are exacerbated. Other challenges include the explore this regime. One recent approach is penalties involved in designing low-operating Near-Threshold Voltage (NTV) logic,11 voltage static RAMs (SRAMs) and the which operates transistors in the near-thres- increased interconnection energy consump- hold regime slightly above the threshold volt- tion due to greater spreading across cores. age, providing more palatable trade-offs between energy and delay than subthreshold Bigger caches. An often-proposed dim-silicon circuits, for which frequency drops exponen- alternative is to simply allocate otherwise tially with voltage decreases. Researchers have dark silicon area for caches. Because only a ............................................................. 12 IEEE MICRO
subset of cache transistors (such as a word- a general-purpose processor.1 Execution line) is accessed each cycle, cache memories hops between coprocessors and general- have low duty cycles and thus are inherently purpose cores, executing where it is most ef- dark. Compared to general-purpose logic, a ficient. The unused cores are power- and level-1 (L1) cache clocked at its maximum clock-gated to keep them from consuming frequency can be about 10 darker per precious energy. Unlike dim silicon, which square millimeter, and larger caches can be tends to focus on manipulating voltages, fre- even darker. Thus, adding cache is one way quencies, and duty cycles as ways to manage to simultaneously increase performance and power, specialized logic focuses on reducing lower power density per square millimeter. the amount of capacitance that needs to be We can imagine, for instance, expanding switched to perform a particular operation. per-core cache at a rate that soaks up the The promise for a future of widespread remaining dark silicon area: 1.4 to 2 specialization is already being realized: we more cache per core per generation. How- are seeing a proliferation of specialized accel- ever, many applications do not benefit erators that span diverse areas such as base- much from additional cache, and upcoming band processing, graphics, computer vision, TSV-integrated DRAM will reduce the and media coding. These accelerators enable cache benefit for those applications that do. orders-of-magnitude improvements in en- ergy efficiency and performance, especially Computational sprinting and Turbo for computations that are highly parallel. Boost. Other techniques employ ‘‘temporal Recent proposals have extrapolated this dimness’’ as opposed to ‘‘spatial dimness,’’ trend and anticipate that the near future temporarily exceeding the nominal thermal will see systems comprising more coproces- budget but relying on thermal capacitance sors than general-purpose processors.1,7 This to buffer against temperature increases, and article refers to these systems as coprocessor- then ramping back to a comparatively dark dominated architectures, or CoDAs. state. Intel’s Turbo Boost 2.0 uses this As specialization usage grows to combat approach to boost performance up until the the dark silicon problem, we are faced with processor reaches nominal temperature, rely- a modern-day specialization ‘‘Tower of ing on the heat sink’s innate thermal capaci- Babel’’ crisis that fragments our notion of tance. ARM’s big.LITTLE employs four general-purpose computation and eliminates A15 cores until the thermal envelope is the traditional clear lines of communication exceeded (anecdotally, about 10 seconds), between programmers and software and the then switches over to four lower-energy, underlying hardware. Already, we see the lower-performance A7 cores. Computational deployment of specialized languages such as sprinting carries this a step further, employ- CUDA that are not usable between similar ing phase-change materials that let chips ex- architectures (for example, AMD and Nvi- ceed their sustainable thermal budget by an dia). We see overspecialization problems be- order of magnitude for several seconds, pro- tween accelerators that cause them to become viding a short but substantial computational inapplicable to closely related classes of com- boost. These modes are especially useful for putations (such as double-precision scientific ‘‘race to finish’’ computations, such as web- codes running incorrectly on a GPU’s non- page rendering, for which response latency IEEE-compliant floating-point hardware). is important, or for which speeding up the Adoption problems are also caused by the ex- transition of both the processor and its sup- cessive costs of programming heterogeneous port logic to a low-power state reduces en- hardware (such as the slow uptake of Sony ergy consumption. PlayStation 3 versus Xbox). Moreover, spe- cialized hardware risks obsolescence as stan- The specialized horseman dards are revised (for example, a JPEG The specialized horseman uses dark sili- standard revision). con to implement a host of specialized co- processors, each either much faster or much Insulating humans from complexity. These more energy efficient (100 to 1,000) than factors speak to potential exponential ............................................................. SEPTEMBER/OCTOBER 2013 13
............................................................................................................................................................................................... DARK SILICON increases in design, verification, and pro- propose alternative architectures that exploit gramming effort for these CoDAs. Combat- specialization like c-cores, but focus on ing the Tower of Babel problem requires improving reconfigurability at the cost of defining a new paradigm for how specializa- energy savings. Recent efforts have also tion is expressed and exploited in future pro- examined the use of approximate neural- cessing systems. We need new scalable network-based computing as an elegant architectural schemas that employ pervasively way to package programmability, reconfi- specialized hardware to minimize energy and gurability, and specialization.18 maximize performance while at the same time insulating the hardware designer and The ‘‘deus ex machina’’ horseman programmer from such systems’ underlying Of the four horsemen, this is by far the complexity. most unpredictable. ‘‘Deus ex machina’’ refers to a plot device in literature or theater Overcoming Amdahl-imposed limits on in which the protagonists seem increasingly specialization. Amdahl’s law provides an ad- doomed until the very last moment, when ditional roadblock for specialization. To something completely unexpected comes save energy across the majority of the com- out of nowhere to save the day. For dark sil- putation, we must find broad-based special- icon, one deus ex machina would be a break- ization approaches that apply to both through in semiconductor devices. However, regular, parallel code and irregular code. as we shall see, the breakthroughs that would We must also ensure that communicating be required would have to be quite funda- specialized processors doesn’t fritter away mental—in fact, we most likely would have their energy savings on costly cross-chip to build transistors out of devices other communication or shared-memory accesses. than MOSFETs. Why? Because MOSFET leakage is set by fundamental principles of Recent efforts. The UCSD GreenDroid device physics, and is limited to a subthresh- processor (see Figure 2)3,15 is one such old slope of 60 mV/decade at room temper- CoDA-based system that seeks to address ature; this corresponds to a reduction of 10 both complexity issues and Amdahl limits. leakage current for every 60 mV that the GreenDroid is a mobile application processor threshold voltage is above the Vss, which is that implements Android mobile environment determined by properties of thermionic hotspots using hundreds of specialized cores emission of carriers across a potential well. called conservation cores, or c-cores.1,9 C-cores, Thus, although innovations such as Intel’s which target both irregular and regular code, FinFET/TriGate transistor and high-K are automatically generated from C or C dielectrics represent significant achievements source code, and support a patching mecha- maintaining a subthreshold slope close to nism that lets them track software changes. their historical values, they still remain with- They attain an estimated 8 to 10 energy- in the scope of the MOSFET-imposed limits efficiency improvement, at no loss in serial and are one-time improvements rather than performance, even on nonparallel code, and scalable changes. without any user or programmer intervention. Two VLSI candidates that bypass these Unlike NTV processors, c-cores need not limits because they are not based on thermal find additional parallelism in the workload to injection are tunnel field-effect transistors cover a serial performance loss. Thus, c-cores (TFETs),19 which are based on tunneling are likely to work across a wider range of work- effects, and nanoelectromechanical system loads, including collections of serial programs. (NEMS) switches,20 which are based on However, for highly parallel workloads in physical relays. TFETs are reputed to have which execution time is loosely concentrated, subthreshold slopes on the order of NTV processors might hold an area advantage 30 mV/decade—twice as good as the ideal because of their reconfigurability. MOSFET—but with lower on-currents Other specialized processors such as the than MOSFETs, limiting their use in University of Wisconsin-Madison’s DySER16 high-performance circuits. NEMS devices and the University of Michigan’s Beret17 have essentially a near-zero subthreshold ............................................................. 14 IEEE MICRO
L1 L1 L1 L1 CPU CPU CPU CPU L1 L1 L1 L1 CPU CPU CPU CPU L1 L1 L1 L1 CPU CPU CPU CPU L1 L1 L1 L1 CPU CPU CPU CPU (a) Tile OCN OCN C I-cache D-cache C C C C I$ C 1 mm D$ Internal state interface C C-core CPU C-core C C C CPU C-core (b) 1 mm FPU C-core (c) Figure 2. The GreenDroid architecture, an example of a coprocessor-dominated architecture (CoDA). The GreenDroid Mobile Application Processor comprises 16 nonidentical tiles (a). Each tile (b) holds components common to every tile—the CPU, on-chip network (OCN), and shared level-1 (L1) data cache—and provides space for multiple conservation cores, or c-cores, of various sizes. A variety of in-tile networks (c) connect components and c-cores. slope but slow switching times. Both TFETs MARCO STARnet program is funding and NEMS devices thus hint at orders- four centers, each focusing on a key direction of-magnitude improvements in leakage but for beyond-CMOS approaches: developing remain untamed and fall short of being electron spin-based memory computation integrated into real chips. devices (C-SPIN), formulating new in- Realizing the importance of the fourth formation-processing models that can lever- horseman, a recent $194 million DARPA/ age statistical (that is, nondeterministic) ............................................................. SEPTEMBER/OCTOBER 2013 15
............................................................................................................................................................................................... DARK SILICON beyond-CMOS devices (SONIC), engineer- optimizations. See if they still make sense. ing nonconventional atomic scale engineered Sharing introduces additional energy materials (FAME), and creating new devices consumption because it requires sharers that extend prior work on TFETs to operate to have longer wires to the shared logic, at even lower voltages (LEAST). and it introduces additional perfor- mance and energy overheads from the Evolutionary design principles for dark control logic that manages the sharing. silicon For example, architectures that have While researchers work to mature the new repositories of nonshared state that ideas represented by the four horsemen, what share physical pipelines (such as large- principles should guide today’s designs that scale multithreading) pay large wire must tackle dark silicon? Listed below is a capacitances inside these memories to set of evolutionary, rather than revolutionary, share that state. As area gets cheaper, it dark silicon design principles that are moti- will make less sense to pay these over- vated by changing trade-offs created by heads, and the degree of sharing will de- dark silicon: crease so that the energy cost of pulling state out of these state repositories will Moving to the next generation will pro- be reduced. vide an automatic 1.4 energy-efficiency Multiplexing and RAMs that facilitate increase. Figure out how you will use it. sharing of program data are still a good As a baseline, chip capabilities will idea. Keep them. If different threads of scale with energy, whether it is allocated control are truly sharing data, multi- to frequency or more cores. You can in- plexed structures, such as shared RAM, crease or decrease frequency or transis- or crossbars, are often still more efficient tor counts, but transistors switched per than coherence protocols or other unit time can increase by only 1.4. schemes. The next generation will create a large Architectural techniques for saving tran- amount of dark area. Determine, for sistors should only be applied if they do your domain, how to trade mostly dark not worsen energy efficiency. Transistors area for energy. If the die area is fixed, are getting exponentially cheaper, and any scaling is going to have a surplus we can’t use them all at once. Why of transistors. Which combination of are we trying to save transistors? Lo- the four horsemen is most effective in cally, transistor-saving optimizations your domain? Should you go dim— make sense, but an exponential wind more caches? Underclocked arrays of is blowing against these optimizations cores? NTV on top of that? Add accel- in the long run. erators or c-cores? Use new kinds of de- Power rails are the new clocks. Design vices? Shrink your chip? with them in mind. Ten years ago, it Pipelining makes less sense than it used to. was a big step to move beyond a few Figure out if faster transistor delays will clock domains. Now, chips can have allow you to fit more in a pipeline stage hundreds of clock domains, all with without reducing frequency. Pipelining their own clock gates. With dark silicon, increases duty cycle and introduces addi- we will see the same effect with power tional capacitance in circuits (registers, rails; we will have hundreds and prediction circuits, bypassing, and clock maybe thousands of power rails in the tree fan out), neither of which is dark sil- future, all with their own power gates, icon friendly. Reducing pipeline depth to manage the leakage for the many het- and increasing FO4 depths reduces erogeneous system components. capacitive overhead. Note, too, that exces- Heterogeneity results from the shift from a sive pipelining and frequency exacerbates 1D objective function (performance) to a the gap between processing and memory. 2D objective function (performance and Architectural multiplexing and logic shar- energy). Design with the shape of this ing are becoming increasingly questionable function in mind. The past lacked in ............................................................. 16 IEEE MICRO
heterogeneity, because designs were switches per second. Compare this to largely measured according to a single arithmetic logic unit (ALU) transistors axis—performance. To first order, there that toggle at three billion times per was a single optimal design point. Now second. The most active neuron’s activ- that performance and energy are both ity is a millionth of that of processing important, a Pareto curve trades off per- transistors in today’s processors. formance and energy, and there is no Low-voltage operation. Brain cells oper- one optimal design across that curve; ate at approximately 100 mV, yielding there are many optimal points. Optimal CV 2 energy savings of 100 versus designs will incorporate several such 1-V operation, in a clear parallel to points across these curves. the dim horseman’s NTV circuits. Communication is low swing and low These rules of thumb will guide our exist- voltage, saving large amounts of energy. ing designs along an evolutionary path to be- Limited sharing and memory multiplex- come increasingly dark silicon friendly—but ing. Any given neuron can switch only what then of more revolutionary approaches? 1,000 times per second, by definition, so it must have extremely limited shar- Insights from the brain: a dark technology ing, because a point of multiplexing Perhaps one promising indicator that low- would be a bottleneck in parallel pro- duty cycle, ‘‘dark technology’’ can be mas- cessing. The human visual system starts tered, unlocking new application domains, is with 6M cones in the retina, similar to the efficiency and density of the human a 2-megapixel display, processes it with brain. The brain, even today, can perform local neurons, and then sends it on the many tasks that computers cannot, especially 1M-neuron optic nerve to the visual vision-related tasks. With 80 billion neurons cortex. There is no central memory and 100 trillion synapses operating at less store; each pixel has a set of its own than 100 mV, the brain embodies an existence ALUs, so to speak, so energy waste proof of highly parallel, reliable, and dark due to multiplexing is minimal. operation, and embodies three of the Data decimation. The human brain horsemen—dim, specialized, and deus ex reduces the data size at each step and machina. Neurons operate with extremely operates on concise but approximate low-duty cycles compared to processors—at representations. If using 2 megapixels best, 1 kilohertz. Although computing with sil- suffices to handle color-related vision icon-simulated neurons introduces excessive tasks, why use more than that? Larger ‘‘interpretive’’ overheads—neurons and transis- sensors would just require more neu- tors have fundamentally different properties— rons to store and compute on the the brain can offer us insight and long-term data. We should ensure that we are pro- ideas about how we can redesign systems for cessing no more data than necessary to the extremely low-duty cycles and low voltages achieve the final outcome. called for by dark silicon. Here are some of Analog operation. The neuron performs these properties, which may give us insight a more complex basic operation than on more revolutionary extensions to the evolu- the typical digital transistor. On the tionary principles proposed in the last section: input side, neurons combine informa- tion from many other neurons; and Specialization. As with the specialized on the output, despite producing rail- horseman, different groups of neurons to-rail digital pulses, encode multiple serve different functions in cognitive bits of information via spikes timings. processing, connect to different sensory Could this suggest that there are more organs, and allow reconfiguration, efficient ways to map operations onto evolving with time synaptic connec- silicon-based technologies? In RF wire- tions customized to the computation. less front-end communications, analog Very dark operation. Neurons fire at a processing enables computations that maximum rate of approximately 1,000 would be impossible to do at speed ............................................................. SEPTEMBER/OCTOBER 2013 17
............................................................................................................................................................................................... DARK SILICON digitally. However, analog techniques 3. N. Goulding et al., ‘‘GreenDroid: A Mobile might not scale well to deep nanometer Application Processor for a Future of Dark technology. Silicon,’’ Hot Chips Symp., 2010. Fast, static, ‘‘gather, reduce, and broad- 4. M. Taylor, ‘‘Is Dark Silicon Useful? Harness- cast’’ operators. Neurons have fan out ing the Four Horsemen of the Coming Dark and fan in of approximately 7,000 to Silicon Apocalypse,’’ Proc. 49th Ann. Design other neurons that are located signifi- Automation Conf. (DAC 12), ACM, 2012, cant distances away. Effectively, they pp. 1131-1136. can perform efficient operations that 5. R.H. Dennard, ‘‘Design of Ion-Implanted combine vector-style gather memory MOSFET’s with Very Small Physical Dimen- accesses to large numbers of static- sions,’’ IEEE J. Solid-State Circuits, memory locations, with a vector-style vol. SC-9, 1974, pp. 256-268. reduction operator and a broadcast. 6. H. Esmaeilzadeh et al., ‘‘Dark Silicon and Do more efficient ways exist for imple- the End of Multicore Scaling,’’ ACM menting these operations in silicon? It SIGARCH Computer Architecture News, could be useful for computations that vol. 39, no. 3, 2011, pp. 365-376. operate on finite-sized static graphs. 7. N. Hardavellas et al., ‘‘Toward Dark Silicon in Servers,’’ IEEE Micro, vol. 31, no. 4, Recently, both the EU and US govern- 2011, pp. 6-15. ments have proposed initiatives to enable 8. W. Huang et al., ‘‘Scaling with Design Con- greater studies of the computational capabil- straints: Predicting the Future of Big Chips,’’ ities of the brain. Although brain-inspired IEEE Micro, vol. 31, no. 4, 2011, pp. 16-29. computing has already come and gone sev- 9. J. Sampson et al., ‘‘Efficient Complex Oper- eral times in the brief history of manmade ators for Irregular Codes,’’ Proc. 17th Int’l computers, dark silicon may cause these Symp. High Performance Computer Archi- approaches to become increasingly relevant. tecture (HPCA 11), IEEE CS, 2011, pp. 491-502. A lthough silicon is getting darker, for researchers the future is bright and ex- citing. Dark silicon will cause a transforma- 10. A. Raghavan et al., ‘‘Computational Sprint- ing,’’ Proc. IEEE 18th Int’l Symp. High- Performance Computer Architecture (HPCA tion of the computational stack and provide 12), IEEE CS, 2012, doi:10.1109/HPCA. many opportunities for investigation. MICRO 2012.6169031. 11. R. Dreslinski et al., ‘‘Near-Threshold Com- Acknowledgments puting: Reclaiming Moore’s Law Through This work was partially supported by Energy Efficient Integrated Circuits,’’ Proc. NSF awards 0846152, 1018850, 0811794, IEEE, vol. 98, no. 2, 2010, pp. 253-266. and 1228992, Nokia and AMD gifts, and 12. E. Krimer et al., ‘‘Synctium: A Near-Threshold by STARnet, an SRC program sponsored Stream Processor for Energy-Constrained by MARCO and DARPA. I thank the anon- Parallel Applications,’’ IEEE Computer Archi- ymous reviewers for their valuable insights tecture Letters, Jan. 2010, pp. 21-24. and suggestions. 13. D. Fick et al., ‘‘Centip3de: A 3930 DMIPS/W Configurable Near-Threshold 3D Stacked .................................................................... System with 64 ARM Cortex-M3 Cores,’’ References Proc. IEEE Int’l Solid-State Circuits Conf., 1. G. Venkatesh et al., ‘‘Conservation Cores: IEEE, 2012, pp. 190-192. Reducing the Energy of Mature Computa- 14. S. Jain et al., ‘‘A 280 mV-to-1.2 V Wide- tions,’’ Proc. 15th Architectural Support Operating-Range IA-32 Processor in 32 nm for Programming Languages and Op- CMOS,’’ Proc. IEEE Int’l Solid-State Circuits erating Systems Conf., ACM, 2010, Conf., IEEE, 2012, pp. 66-68. pp. 205-218. 15. N. Goulding-Hotta et al., ‘‘The GreenDroid 2. R. Merrit, ‘‘ARM CTO: Power Surge Could Mobile Application Processor: An Architec- Create ‘Dark Silicon,’’’ EE Times, 22 Oct. ture for Silicon’s Dark Future,’’ IEEE Micro, 2009. vol. 31, no. 2, 2011, pp. 86-95. ............................................................. 18 IEEE MICRO
16. V. Govindaraju, C.-H. Ho, and K. Sankaralin- Solid-State Circuits Conf., IEEE, 2010, gam, ‘‘Dynamically Specialized Datapaths pp. 150-151. for Energy Efficient Computing,’’ Proc. IEEE 17th Int’l Symp. High-Performance Computer Michael B. Taylor is an associate professor Architecture (HPCA 11), IEEE CS, 2011, in the Department of Computer Science and doi:10.1109/HPCA.2011.5749755. Engineering at the University of California, 17. S. Gupta et al., ‘‘Bundled Execution of Re- San Diego, where he leads the Center for curring Traces for Energy-Efficient General Dark Silicon. His research interests include Purpose Processing,’’ Proc. 44th Ann. dark silicon, chip design, parallelization IEEE/ACM Int’l Symp. Microarchitecture, tools, and Bitcoin computing systems. ACM, 2011, pp. 12-23. Taylor has a PhD in electrical engineering 18. H. Esmaeilzadeh et al., ‘‘Neural Acceleration and computer science from the Massachu- for General-Purpose Approximate Pro- setts Institute of Technology. grams,’’ Proc. 45th Ann. IEEE/ACM Int’l Symp. Microarchitecture, IEEE CS, 2012, Direct questions and comments about pp. 449-460. this article to Michael B. Taylor, 9500 19. A. Ionescu et al., ‘‘Tunnel Field-Effect Transis- Gilman Drive, MC 0404 EBU 3B 3202, La tors as Energy-Efficient Electronic Switches,’’ Jolla, CA 92093-0404; mbtaylor@ucsd.edu. Nature, 17 Nov. 2011, pp. 329-337. 20. F. Chen et al., ‘‘Demonstration of Integrated Micro-Electro-Mechanical Switch Circuits for VLSI Applications,’’ Proc. IEEE Int’l ............................................................. SEPTEMBER/OCTOBER 2013 19
You can also read