An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 29 An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS Sriram R. Vangal, Member, IEEE, Jason Howard, Gregory Ruhl, Member, IEEE, Saurabh Dighe, Member, IEEE, Howard Wilson, James Tschanz, Member, IEEE, David Finan, Arvind Singh, Member, IEEE, Tiju Jacob, Shailendra Jain, Vasantha Erraguntla, Member, IEEE, Clark Roberts, Yatin Hoskote, Member, IEEE, Nitin Borkar, and Shekhar Borkar, Member, IEEE Abstract—This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8 10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon Fig. 1. NoC architecture. achieves over 1.0 TFLOPS of performance on a range of bench- marks while dissipating 97 W at 4.27 GHz and 1.07 V supply. Index Terms—CMOS digital integrated circuits, crossbar router multiprocessors include the RAW [3], TRIPS [4], and ASAP and network-on-chip (NoC), floating-point unit, interconnection, [5] projects. These tiled architectures show promise for greater leakage reduction, MAC, multiply-accumulate. integration, high performance, good scalability and potentially high energy efficiency. I. INTRODUCTION With the increasing demand for interconnect bandwidth, on-chip networks are taking up a substantial portion of system power budget. The 16-tile MIT RAW on-chip network con- T HE scaling of MOS transistors into the nanometer regime opens the possibility for creating large scalable Net- work-on-Chip (NoC) architectures [1] containing hundreds sumes 36% of total chip power, with each router dissipating 40% of individual tile power [6]. The routers and the links of the Alpha 21364 microprocessor consume about 20% of the of integrated processing elements with on-chip communica- total chip power. With on-chip communication consuming a tion. NoC architectures, with structured on-chip networks are significant portion of the chip power and area budgets, there is emerging as a scalable and modular solution to global commu- a compelling need for compact, low power routers. At the same nications within large systems-on-chip. The basic concept is time, while applications dictate the choice of the compute core, to replace today’s shared buses with on-chip packet-switched the advent of multimedia applications, such as three-dimen- interconnection networks [2]. NoC architectures use layered sional (3-D) graphics and signal processing, places stronger protocols and packet-switched networks which consist of demands for self-contained, low-latency floating-point proces- on-chip routers, links, and well defined network interfaces. As sors with increased throughput. A computational fabric built shown in Fig. 1, the NoC architecture basic building block is the using these optimized building blocks is expected to provide “network tile”. The tiles are connected to an on-chip network high levels of performance in an energy efficient manner. This that routes packets between them. Each tile may consist of paper describes design details of an integrated 80-tile NoC one or more compute cores and include logic responsible for architecture implemented in a 65-nm process technology. The routing and forwarding the packets, based on the routing policy prototype is designed to deliver over 1.0 TFLOPS of average of the network. The structured network wiring of such a NoC performance while dissipating less than 100 W. design gives well-controlled electrical parameters that simpli- The remainder of the paper is organized as follows. Section II fies timing and allows the use of high-performance circuits to gives an architectural overview of the 80-tile NoC and describes reduce latency and increase bandwidth. Recent tile-based chip the key building blocks. The section also explains the FPMAC unit pipeline and design optimizations used to accomplish single-cycle accumulation. Router architecture details, NoC Manuscript received April 16, 2007; revised September 27, 2007. S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, communication protocol and packet formats are also described. C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar are with the Microprocessor Section III describes chip implementation details, including the Technology Laboratories, Intel Corporation, Hillsboro, OR 97124 USA (e-mail: high-speed mesochronous clock distribution network used in sriram.r.vangal@intel.com). this design. Details of the circuits used for leakage power man- A. Singh, T. Jacob, S. Jain, and V. Erraguntla are with the Microprocessor Technology Laboratories, Intel Corporation, Bangalore 560037, India. agement in both logic and memory blocks are also discussed. Digital Object Identifier 10.1109/JSSC.2007.910957 Section IV presents the chip measurement results. Section V 0018-9200/$25.00 © 2008 IEEE Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
30 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 Fig. 2. NoC block diagram and tile architecture. concludes by summarizing the NoC architecture along with key are 32-bit inputs in IEEE-754 single-precision format [9]. The performance and power numbers. design is capable of sustained pipelined performance of one FPMAC instruction every 250 ps. The multiplier is designed II. NOC ARCHITECTURE using a Wallace tree of 4-2 carry-save adders. The well-matched The NoC architecture (Fig. 2) contains 80 tiles arranged as an delays of each Wallace tree stage allow for highly efficient 8 10 2-D mesh network that is designed to operate at 4 GHz pipelining . Four Wallace tree stages are used to com- [7]. Each tile consists of a processing engine (PE) connected to press the partial product bits to a sum and carry pair. Notice that a 5-port router with mesochronous interfaces (MSINT), which the multiplier does not use a carry propagate adder at the final forwards packets between the tiles. The 80-tile on-chip network stage. Instead, the multiplier retains the output in carry-save enables a bisection bandwidth of 2 Terabits/s. The PE contains format and converts the result to base 32 (at stage ), prior to two independent fully-pipelined single-precision floating-point accumulation. In an effort to achieve fast single-cycle accumu- multiply-accumulator (FPMAC) units, 3 KB single-cycle in- lation, we first analyzed each of the critical operations involved struction memory (IMEM), and 2 KB of data memory (DMEM). in conventional FPUs with the intent of eliminating, reducing A 96-bit Very Long Instruction Word (VLIW) encodes up to or deferring the logic operations inside the accumulate loop eight operations per cycle. With a 10-port (6-read, 4-write) reg- and identified the following three optimizations [8]. ister file, the architecture allows scheduling to both FPMACs, 1) The accumulator (stage ) retains the multiplier output in simultaneous DMEM load and stores, packet send/receive from carry-save format and uses an array of 4-2 carry save adders mesh network, program control, and dynamic sleep instructions. to “accumulate” the result in an intermediate format. This A router interface block (RIB) handles packet encapsulation be- removes the need for a carry-propagate adder in the critical tween the PE and router. The fully symmetric architecture al- path. lows any PE to send (receive) instruction and data packets to 2) Accumulation is performed in base 32 system, converting (from) any other tile. The 15 fan-out-of-4 (FO4) design uses a the expensive variable shifters in the accumulate loop to balanced core and router pipeline with critical stages employing constant shifters. performance setting semi-dynamic flip-flops. In addition, a scal- 3) The costly normalization step is moved outside the accu- able low power mesochronous clock distribution is employed in mulate loop, where the accumulation result in carry-save is a 65-nm eight-metal CMOS process that enables high integra- added (stage ), the sum normalized (stage ) and con- tion and single-chip realization of the teraFLOPS processor. verted back to base 2 (stage ). These optimizations allow accumulation to be implemented A. FPMAC Architecture in just 15 FO4 stages. This approach also reduces the latency The nine-stage pipelined FPMAC architecture (Fig. 3) uses a of dependent FPMAC instructions and enables a sustained mul- single-cycle accumulate algorithm [8] with base 32 and internal tiply-add result (2 FLOPS) every cycle. Careful pipeline re-bal- carry-save arithmetic with delayed addition. The FPMAC con- ancing allows removal of 3 pipe-stages resulting in a 25% la- tains a fully pipelined multiplier unit (pipe stages ), and tency improvement over work in [8]. The dual FPMACs in each a single-cycle accumulation loop , followed by pipelined PE provide 16 GFLOPS of aggregate performance and are crit- addition and normalization units . Operands A and B ical to achieving the goal of teraFLOPS performance. Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 31 Fig. 3. FPMAC nine-stage pipeline with single-cycle accumulate loop. B. Instruction Set TABLE I INSTRUCTION TYPES AND LATENCY The architecture defines a 96-bit VLIW which allows a max- imum of up to eight operations to be issued every cycle. The in- structions fall into one of the six categories (Table I): Instruction issue to both floating-point units, simultaneous data memory load and stores, packet send/receive via the on-die mesh net- work, program control using jump and branch instructions, syn- chronization primitives for data transfer between PEs and dy- namic sleep instructions. The data path between the DMEM and the register file supports transfer of two 32-bit data words per cycle on each load (or store) instruction. The register file 3-bit destination ID field (DID) specifies the router exit port. issues four 32-bit data words to the dual FPMACs per cycle, This field is updated at each hop. Flow control and buffer while retiring two 32-bit results every cycle. The synchroniza- management between routers is debit-based using almost-full tion instructions aid with data transfer between tiles and allow bits, which the receiver queue signals via two flow control bits the PE to stall while waiting for data (WFD) to arrive. To aid when its buffers reach a specified threshold. Each with power management, the architecture provides special in- header FLIT supports a maximum of 10 hops. A chained header structions for dynamic sleep and wakeup of each PE, including (CH) bit in the packet provides support for larger number of independent sleep control of each floating-point unit inside the hops. Processing engine control information including sleep PE. The architecture allows any PE to issue sleep packets to any and wakeup control bits are specified in the FLIT_1 that fol- other tile or wake it up for processing tasks. With the exception lows the header FLIT. The minimum packet size required by of FPU instructions which have a pipelined latency of nine cy- the protocol is two FLITs. The router architecture places no cles, most instructions execute in 1–2 cycles. restriction on the maximum packet size. C. NoC Packet Format D. Router Architecture Fig. 4 describes the NoC packet structure and routing pro- A 4 GHz five-port two-lane pipelined packet-switched router tocol. The on-chip 2-D mesh topology utilizes a 5-port router core (Fig. 5) with phase-tolerant mesochronous links forms based on wormhole switching, where each packet is subdivided the key communication fabric for the 80-tile NoC architecture. into “FLITs” or “Flow control unITs”. Each FLIT contains six Each port has two 39-bit unidirectional point-to-point links. control signals and 32 data bits. The packet header (FLIT_0) The input-buffered wormhole-switched router [18] uses two allows for a flexible source-directed routing scheme, where a logical lanes (lane 0–1) for dead-lock free routing and a fully Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 Fig. 4. NoC protocol: packet format and FLIT description. TABLE II ROUTER COMPARISON OVER WORK IN [11] in [11] when ported and compared in the same 65-nm process [12]. Results from comparison are summarized in Table II. E. Mesochronous Communication The 2-mm-long point-to-point unidirectional router links im- plement a phase-tolerant mesochronous interface (Fig. 7). Four Fig. 5. Five-port two-lane shared crossbar router architecture. of the five router links are source synchronous, each providing a strobe (Tx_clk) with 38 bits of data. To reduce active power, Tx_clk is driven at half the clock rate. A 4-deep circular FIFO, non-blocking crossbar switch with a total bandwidth of 80 GB/s built using transparent latches captures data on both edges of bits GHz ports . Each lane has a 16 FLIT queue, the delayed link strobe at the receiver. The strobe delay and arbiter and flow control logic. The router uses a 5-stage pipeline duty-cycle can be digitally programmed using the on-chip scan with a two stage round-robin arbitration scheme that first binds chain. A synchronizer circuit sets the latency between the FIFO an input port to an output port in each lane and then selects a write and read pointers to 1 or 2 cycles at each port, depending pending FLIT from one of the two lanes. A shared data path on the phase of the arriving strobe with respect to the local clock. architecture allows crossbar switch re-use across both lanes on A more aggressive low-latency setting reduces the synchroniza- a per-FLIT basis. The router links implement a mesochronous tion penalty by one cycle. The interface includes the first stage interface with first-in-first-out (FIFO) based synchronization at of the router pipeline. the receiver. The router core features a double-pumped crossbar switch F. Router Interface Block (RIB) [10] to mitigate crossbar interconnect routing area. The The RIB is responsible for message-passing and aids with schematic in Fig. 6(a) shows the 36-bit crossbar data bus synchronization of data transfer between the tiles and with double-pumped at the fourth pipe-stage of the router by inter- power management at the PE level. Incoming 38-bit-wide leaving alternate data bits using dual edge-triggered flip-flops, FLITs are buffered in a 16-entry queue, where demultiplexing reducing crossbar area by 50%. In addition, the proposed based on the lane-ID and framing to 64 bits for data packets router architecture shares the crossbar switch across both lanes (DMEM) and 96 bits for instruction packets (IMEM) are on an individual FLIT basis. Combined application of both accomplished. The buffering is required during program ex- ideas enables a compact 0.34 mm design, resulting in a 34% ecution since DMEM stores from the 10-port register file reduction in router layout area as shown in Fig. 6(b), 26% fewer have priority over data packets received by the RIB. The unit devices, 13% improvement in average power and one cycle decodes FLIT_1 (Fig. 4) of an incoming instruction packet latency reduction (from 6 to 5 cycles) over the router design and generates several PE control signals. This allows the PE Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 33 Fig. 6. (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [11]. Fig. 7. Phase-tolerant mesochronous interface and timing diagram. receiving a full data packet, the RIB generates a break signal to continue execution, if the IMEM is in a stalled (WFD) mode. Upon receipt of a sleep packet via the mesh network, the RIB unit can also dynamically put the entire PE to sleep or wake it up for processing tasks on demand. III. DESIGN DETAILS To allow 4 GHz operation, the entire core is designed using hand-optimized data path macros. CMOS static gates are used to implement most of the logic. However, critical registers in Fig. 8. Semi-dynamic flip-flop (SDFF) schematic. the FPMAC and router logic utilize implicit-pulsed semi-dy- namic flip-flops (SDFF) [13], [14]. The SDFF (Fig. 8) has a to start execution (REN) at a specified IMEM address (PCA) dynamic master stage coupled to a pseudo-static slave stage. and is enabled by the new program counter (NPC) bit. After The FPMAC accumulator register is built using data-inverting Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
34 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 Fig. 9. (a) Global mesochronous clocking and (b) simulated clock arrival times. rising edge-triggered SDFFs with synchronous reset and en- able. The negative setup time of the flip-flop is taken advan- tage of in the critical path. When compared to a conventional static master–slave flip-flop, SDFF provides both shorter latency and the capability of incorporating logic functions with min- imum delay penalty, properties which are desirable in high-per- formance digital designs. The chip uses a scalable global mesochronous clocking tech- nique, which allows for clock phase-insensitive communication across tiles and synchronous operation within each tile. The on-chip phase-locked loop (PLL) output [Fig. 9(a)] is routed using horizontal M8 and vertical M7 spines. Each spine con- sists of differential clocks for low duty-cycle variation along the worst-case clock route of 26 mm. An opamp at each tile con- Fig. 10. Router and on-die network power management. verts the differential clocks to a single ended clock with a 50% duty cycle prior to distributing the clock across the tile using a balanced H-tree. This clock distribution scales well as tiles are enable signals also activate the nMOS sleep transistors in the added or removed. The worst case simulated global duty-cycle input queue arrays of both lanes. The 360 m nMOS sleep variation is 3 ps and local clock skew within the tile is 4 ps. device in the register-file is sized to provide a 4.3X reduction in Fig. 9(b) shows simulated clock arrival times for all 80 tiles at array leakage power with a 4% frequency impact. The global 4 GHz operation. Note that multiple cycles are required for the clock buffer feeding the router is finally gated at the tile level global clock to propagate to all 80 tiles. The systematic clock based on port activity. skews inherent in the distribution help spread peak currents due Each FPMAC implements unregulated sleep transistors with to simultaneous clock switching over the entire cycle. no data retention [Fig. 11(a)]. A 6-cycle pipelined wakeup se- Fine-grained clock gating [Fig. 9(a)], sleep transistor and quence largely mitigates current spikes over single-cycle re-ac- body bias circuits [15] are used to reduce active and standby tivation scheme, while allowing floating point unit execution to leakage power, which are controlled at full-chip, tile-slice, and start one-cycle into wakeup. A faster 3-cycle fast-wake option individual tile levels based on workload. Each tile is partitioned is also supported. On the other hand, memory arrays use a reg- into 21 smaller sleep regions with dynamic control of individual ulated active clamped sleep transistor circuit [Fig. 11(b)] that blocks in PE and router units based on instruction type. The ensures data retention and minimizes standby leakage power router is partitioned into 10 smaller sleep regions with control of [16]. The closed-loop opamp configuration ensures that the vir- individual router ports, depending on network traffic patterns. tual ground voltage is no greater than a input The design uses nMOS sleep transistors to reduce frequency voltage under PVT variations. is set based on memory penalty and area overhead. Fig. 10 shows the router and on-die cell standby voltage. The average sleep transistor area network power management scheme. The enable signals gate overhead is 5.4% with a 4% frequency penalty. About 90% of the clock to each port, MSINT and the links. In addition, the FPU logic and 74% of each PE is sleep-enabled. In addition, Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 35 Fig. 11. (a) FPMAC pipelined wakeup diagram and (b) state-retentive memory clamp circuit. Fig. 12. Full-chip and tile micrograph and characteristics. forward body bias can be externally applied to all nMOS de- cide for lower resistance and a second-generation strained sil- vices during active mode to increase the operating frequency icon technology. The interconnect uses eight copper layers and and reverse body bias can be applied during idle mode for fur- a low- carbon-doped oxide inter-layer dielectric. ther leakage savings. The functional blocks of the chip and individual tile are identi- fied in the die photographs in Fig. 12. The 275 mm fully custom IV. EXPERIMENTAL RESULTS design contains 100 million transistors. Using a fully-tiled ap- The teraFLOPS processor is fabricated in a 65-nm process proach, each 3 mm tile is drawn complete with C4 bumps, technology [12] with a 1.2-nm gate-oxide thickness, nickel sali- power, global clock and signal routing, which are seamlessly Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
36 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 Fig. 13. (a) Package die-side. (b) Package land-side. (c) Evaluation board. arrayed by abutment. Each tile contains 1.2 million transistors with the processing engine accounting for 1 million (83%) and the router 17% of the total tile device count. De-coupling ca- pacitors occupy about 20% of the total logic area. The chip has three independent voltage regions, one for the tiles, a separate supply for the PLL and a third one for the I/O circuits. Test and debug features include a TAP controller and full-scan support for all memory blocks on chip. The evaluation board with the packaged chip is shown in Fig. 13. The die has 8390 C4 solder bumps, arrayed with a single uniform bump pitch across the entire die. The chip-level power distribution consists of a uniform M8-M7 grid aligned with the C4 power and ground bump array. The package is a Fig. 14. Measured chip F and peak performance. 66 mm 66 mm flip-chip LGA (land grid array) and includes an integrated heat spreader. The package has a 14-layer stack-up TABLE III (5-4-5) to meet the various power planes and signal require- APPLICATION PERFORMANCE MEASURED AT 1.07 V AND 4.27 GHz OPERATION ments and has a total of 1248 pins, out of which 343 are signal pins. Decoupling capacitors are mounted on the land-side of the package as shown in Fig. 13(b). A PC running custom software is used to apply test vectors and observe results through the on-chip scan chain. First silicon has been validated to be fully functional. Frequency versus power supply on a typical part is shown in Fig. 14. Silicon chip measurements at a case temperature of 80 C demonstrates chip maximum frequency of 1 GHz at 670 mV and 3.16 GHz at 950 mV, with frequency increasing to 5.1 GHz at 1.2 V and 5.67 GHz at 1.35 V. With all 80 tiles actively performing single precision Several application kernels have been mapped to the design block-matrix operations, the chip achieves a peak perfor- and the performance is summarized in Table III. The table mance of 0.32 TFLOPS (670 mV), 1.0 TFLOPS (950 mV), shows the single-precision floating point operation count, the 1.63 TFLOPS (1.2 V) and 1.81 TFLOPS (1.35 V). number of active tiles ( ) and the average performance in Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 37 Fig. 16. Measured chip energy efficiency for stencil application. Fig. 15. Stencil application performance measured at 1.07 V and 4.27 GHz operation. TFLOPS for each application, reported as a percentage of the peak performance achievable with the design. In each case, task mapping was hand optimized and communication was overlapped with computation as much as possible to increase efficiency. The stencil code solves a steady-state 2-D heat diffusion equation with periodic boundary conditions on left and right boundaries of a rectilinear grid, and prescribed temperatures on top and bottom boundaries. For the stencil Fig. 17. Estimated (a) tile power profile and (b) communication power kernel, chip measurements indicate an average performance breakdown. of 1.0 TFLOPS at 4.27 GHz and 1.07 V supply with 358K floating-point operations, achieving 73.3% of the peak per- formance. This data is particularly impressive because the frequency scaling. As expected, the chip energy efficiency in- execution is entirely overlapped with local loads and stores creases with power supply reduction, from 5.8 GFLOPS/W at and communication between neighboring tiles. The SGEMM 1.35 V supply to 10.5 GFLOPS/W at the 1.0 TFLOPS goal matrix multiplication code operates on two 100 100 matrices to a maximum of 19.4 GFLOPS/W at 750 mV supply. Below with 2.63 million floating-point operations, corresponding to 750 mV, the chip degrades faster than power saved by an average performance of 0.51 TFLOPS. It is important to lowering tile supply voltage, resulting in overall performance re- note that the read bandwidth from local data memory limits duction and consequent drop in the processor energy efficiency. the performance to half the peak rate. The spreadsheet kernel The chip provides up to 394 GFLOPS of aggregate performance applies reductions to tables of data consisting of pairs of values at 750 mV with a measured total power dissipation of just 20 W. and weights. For each table the weighted sum of each row Fig. 17 presents the estimated power breakdown at the tile and and each column is computed. A 64-point 2-D FFT which router levels, which is simulated at 4 GHz, 1.2 V supply and at implements the Cooley–Tukey algorithm [17] using 64 tiles has 110 C. The processing engine with the dual FPMACs, instruc- also been successfully mapped to the design with an average tion and data memory, and the register file accounts for 61% of performance of 27.3 GFLOPS. It first computes 8-point FFTs the total tile power [Fig. 17(a)]. The communication power is in each tile, which in turn passes results to 63 other tiles for the significant at 28% of the tile power and the synchronous tile- 2-D FFT computation. The complex communication pattern level clock distribution accounts for 11% of the total. Fig. 17(b) results in high overhead and low efficiency. shows a more detailed tile to tile communication power break- Fig. 15 shows the total chip power dissipation with the ac- down, which includes the router, mesochronous interfaces and tive and leakage power components separated as a function of links. Clocking power is the largest component, accounting for frequency and power supply with case temperature maintained 33% of the communication power. The input queues on both at 80 C. We report measured power for the stencil applica- lanes and data path circuits is the second major component dis- tion kernel, since it is the most computationally intensive. The sipating 22% of the communication power. chip power consumption ranges from 15.6 W at 670 mW to Fig. 18 shows the output differential and single ended clock 230 W at 1.35 V. With all 80 tiles actively executing stencil waveforms measured at the farthest clock buffer from the phase- code the chip achieves 1.0 TFLOPS of average performance at locked loop (PLL) at a frequency of 5 GHz. Notice that the 4.27 GHz and 1.07 V supply with a total power dissipation of duty cycles of the clocks are close to 50%. Fig. 19 plots the 97 W. The total power consumed increases to 230 W at 1.35 V global clock distribution power as a function of frequency and and 5.67 GHz operation, delivering 1.33 TFLOPS of average power supply. This is the switching power dissipated in the performance. Fig. 16 plots the measured energy efficiency in clock spines from the PLL to the opamp at the center of all GFLOPS/W for the stencil application with power supply and 80 tiles. Measured silicon data at 80 C shows that the power Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
38 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 Fig. 18. Measured global clock distribution waveforms. Fig. 19. Measured global clock distribution power. Fig. 20. Measured chip leakage power as percentage of total power versus Vcc. A 2X reduction is obtained by turning off sleep transistors. is 80 mW at 0.8 V and 1.7 GHz frequency, increasing by 10X to 800 mW at 1 V and 3.8 GHz. The global clock distribution power is 2 W at 1.2 V and 5.1 GHz and accounts for just 1.3% of the total chip power. Fig. 20 plots the chip leakage power as a percentage of the total power with all 80 processing engines and routers awake and with all the clocks disabled. Measure- ments show that the worst-case leakage power in active mode varies from a minimum of 9.6% to a maximum of 15.7% of the total power when measured over the power supply range of 670 mV to 1.35 V. In sleep mode, the nMOS sleep transistors are turned off, reducing chip leakage by 2X, while preserving the logic state in all memory arrays. Fig. 21 shows the active and leakage power reduction due to a combination of selective router port activation, clock gating and sleep transistor techniques de- scribed in Section III. Measured at 1.2 V, 80 C and 5.1 GHz op- Fig. 21. On-die network power reduction benefit. eration, the total network power per-tile can be lowered from a maximum of 924 mW with all router ports active to 126 mW, re- sulting in a 7.3X reduction. The network leakage power per-tile V. CONCLUSION with all ports and global clock buffers feeding the router dis- In this paper, we have presented an 80-tile high-performance abled is 126 mW. This number includes power dissipated in the NoC architecture implemented in a 65-nm process technology. router, MSINT and the links. The prototype contains 160 lower-latency FPMAC cores Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 39 and features a single-cycle accumulator architecture for high [9] IEEE Standard for Binary Floating-Point Arithmetic IEEE Standards throughput. Each tile also contains a fast and compact router Board, New York, Tech. Rep. ANSI/IEEE Std. 754-1985, 1985. [10] S. Vangal, N. Borkar, and A. Alvandpour, “A six-port 57 GB/s double- operating at core speed where the 80 tiles are interconnected pumped non-blocking router core,” in Symp. VLSI Circuits, Jun. 2005, using a 2-D mesh topology providing a high bisection band- pp. 268–269. width of over 2 Terabits/s. The design uses a combination [11] H. Wilson and M. Haycock, “A six-port 30-GB/s non-blocking router of micro-architecture, logic, circuits and a 65-nm process to component using point-to-point simultaneous bidirectional signaling for high-bandwidth interconnects,” IEEE J. Solid-State Circuits, vol. reach target performance. Silicon operates over a wide voltage 36, no. 12, pp. 1954–1963, Dec. 2001. and frequency range, and delivers teraFLOPS performance [12] P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane, with high power efficiency. For the most computationally R. Heussner, M. Hussein, J. Hwang, D. Ingerly, R. James, J. Jeong, C. Kenyon, E. Lee, S.-H. Lee, N. Lindert, M. Liu, Z. Ma, T. Marieb, intensive application kernel, the chip achieves an average per- A. Murthy, R. Nagisetty, S. Natarajan, J. Neirynck, A. Ott, C. Parker, formance of 1.0 TFLOPS, while dissipating 97 W at 4.27 GHz J. Sebastian, R. Shaheed, S. Sivakumar, J. Steigerwald, S. Tyagi, C. and 1.07 V supply, corresponding to an energy efficiency of Weber, B. Woolery, A. Yeoh, K. Zhang, and M. Bohr, “A 65 nm logic 10.5 GFLOPS/W. Average performance scales to 1.33 TFLOPS technology featuring 35 nm gate lengths, enhanced channel strain, 8 Cu interconnect layers, low-k ILD and 0.57 m SRAM cell,” in IEDM at a maximum operational frequency of 5.67 GHz and 1.35 V Tech. Dig., Dec. 2004, pp. 657–660. supply. These results demonstrate the feasibility for high-per- [13] F. Klass, “Semi-dynamic and dynamic flip-flops with embedded logic,” formance and energy-efficient building blocks for peta-scale in Symp. VLSI Circuits Dig. Tech. Papers, 1998, pp. 108–109. [14] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De, computing in the near future. “Comparative delay and energy of single edge-triggered and dual edge- triggered pulsed flip-flops for high-performance microprocessors,” in Proc. ISLPED, 2001, pp. 147–151. [15] J. Tschanz, S. Narendra, Y. Ye, B. Bloechel, S. Borkar, and V. De, “Dy- ACKNOWLEDGMENT namic sleep transistor and body bias for active leakage power control of microprocessors,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. The authors would like to thank V. De, Prof. A. Alvandpour, 1838–1845, Nov. 2003. [16] M. Khellah, D. Somasekhar, Y. Ye, N. Kim, J. Howard, G. Ruhl, M. D. Somasekhar, D. Jenkins, P. Aseron, J. Collias, B. Nefcy, Sunna, J. Tschanz, N. Borkar, F. Hamzaoglu, G. Pandya, A. Farhang, P. Iyer, S. Venkataraman, S. Saha, M. Haycock, J. Schutz, and K. Zhang, and V. De, “A 256-Kb dual-Vcc SRAM building block in J. Rattner for help, encouragement, and support; T. Mattson, 65-nm CMOS process with actively clamped sleep transistor,” IEEE J. R. Wijngaart, and M. Frumkin from SSG and ARL teams at Intel Solid-State Circuits, vol. 42, no. 1, pp. 233–242, Jan. 2007. [17] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calcula- for assistance with mapping the kernels to the design; the LTD tion of complex Fourier series,” Math. Comput., vol. 19, pp. 297–301, and ATD teams for PLL and package design and assembly; the 1965. entire mask design team for chip layout; and the reviewers for [18] S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar, and A. Alvand- pour, “A 5.1 GHz 0.34 mm router for network-on-chip applications,” their useful remarks. in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2007, pp. 42–43. REFERENCES [1] L. Benini and G. De Micheli, “Networks on chips: A new SoC para- digm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002. [2] W. J. Dally and B. Towles, “Route packets, not wires: On-chip inter- Sriram R. Vangal (S’90–M’98) received the B.S. connection networks,” in Proc. 38th Design Automation Conf., Jun. degree from Bangalore University, India, the M.S. 2001, pp. 681–689. degree from the University of Nebraska, Lincoln, [3] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green- and the Ph.D. degree from Linköping University, wald, H. Hoffmann, P. Johnson, J. W. Lee, W. Lee, A. Ma, A. Saraf, Sweden, all in electrical engineering. His Ph.D. re- M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, search focused on energy-efficient network-on-chip and A. Agarwal, “The Raw microprocessor: A computational fabric (NoC) designs. for software circuits and general-purpose programs,” IEEE Micro, vol. With Intel since 1995, he is currently a Senior Re- search Scientist at Microprocessor Technology Labs, 22, no. 2, pp. 25–35, Mar.-Apr. 2002. Hillsboro, OR. He was the technical lead for the ad- [4] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, vanced prototype team that designed the industry’s S. W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with first single-chip Teraflops processor. His research interests are in the area of the polymorphous TRIPS architecture,” in Proc. 30th Annu. Int. Symp. low-power high-performance circuits, power-aware computing and NoC archi- Computer Architecture, 2003, pp. 422–433. tectures. He has published 16 papers and has 16 issued patents with 7 pending [5] Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. in these areas. Work, T. Mohsenin, M. Singh, and B. Baas, “An asynchronous array of simple processors for DSP applications,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2006, pp. 428–429. [6] H. Wang, L. Peh, and S. Malik, “Power-driven design of router mi- Jason Howard received the M.S.E.E. degree from croarchitectures in on-chip networks,” in MICRO-36, Proc. 36th Annu. Brigham Young University, Provo, UT, in 2000. IEEE/ACM Int. Symp. Micro Architecture, 2003, pp. 105–116. He was an Intern with Intel Corporation during the [7] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, summers of 1998 and 1999 working on the Pentium D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. 4 microprocessor. In 2000, he formally joined Intel, Hoskote, and N. Borkar, “An 80-tile 1.28 TFLOPS network-on-chip working in the Oregon Rotation Engineers Program. in 65 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp. After two successful rotations through NCG and 98–99. CRL, he officially joined Intel Laboratories, Hills- [8] S. Vangal, Y. Hoskote, N. Borkar, and A. Alvandpour, “A 6.2-GFLOPS boro, OR, in 2001. He is currently working for the floating-point multiply-accumulator with conditional normalization,” CRL Prototype Design team. He has co-authored IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2314–2323, Oct. 2006. several papers and patents pending. Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
40 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 Gregory Ruhl (M’07) received the B.S. degree in Arvind Singh (M’05) received the B.Tech. degree in computer engineering and the M.S. degree in elec- electronics and communication engineering from the trical and computer engineering from the Georgia In- Institute of Engineering and Technology, Lucknow, stitute of Technology, Atlanta, in 1998 and 1999, re- India, in 2000, and the M.Tech. degree in VLSI de- spectively. sign from the Indian Institute of Technology, Delhi, He joined Intel Corporation, Hillsboro, OR, in India, in 2001. He holds the University Gold Medal 1999 as a part of the Rotation Engineering Program for his B.Tech. degree. where he worked on the PCI-X I/O switch, Gigabit He is a Research Scientist at the Microprocessor Ethernet validation, and individual circuit research Technology Labs, Intel Bangalore, India, working projects. After completing the REP program, he on scalable on-die interconnect fabric for terascale joined Intel’s Circuits Research Lab where he has research and prototyping. His research interests been working on design, research and validation on a variety of topics ranging include network-on-chip technologies and energy-efficient building blocks. from SRAMs and signaling to terascale computing. Prior to Intel, he worked on high-performance network search engines and low-power SRAMs in Cypress Semiconductors. Mr. Singh is a member of the IEEE and TiE. Saurabh Dighe (M’05) received the B.E. degree in electronics engineering from University of Mumbai, Tiju Jacob received the B.Tech. degree in electronics India, in 2001, and the M.S. degree in computer engi- and communication engineering from NIT, Calicut, neering from University of Minnesota, Minneapolis, India, in 2000, and the M.Tech. degree in microelec- in 2003. His work focused on computer architecture tronics and VLSI design from IIT Madras, India, in and the design, modeling and simulation of digital 2003. systems. He joined Intel Corporation, Bangalore, India, in He was with Intel Corporation, Santa Clara, CA, Feb, 2003 and currently a member of Circuit Re- working on front-end logic and validation method- search Labs. His areas of interests include low-power ologies for the Itanium processor and the Core pro- high-performance circuits, many core processors cessor design team. Currently, he is with Intel Corpo- and advanced prototyping. ration’s Circuit Research Labs, Microprocessor Technology Labs in Hillsboro, OR and is a member of the prototype design team involved in the definition, implementation and validation of future terascale computing technologies. Shailendra Jain received the B.E. degree in electronics engineering from the Devi Ahilya Vishwavidyalaya, Indore, India, in 1999, and the Howard Wilson was born in Chicago, IL, in 1957. M.Tech. degree in electrical engineering from the He received the B.S. degree in electrical engineering IIT, Madras, India, in 2001. from Southern Illinois University, Carbondale, in Since 2004, he has been with Intel, Bangalore 1979. Design Lab, India. His work includes research and From 1979 to 1984 he worked at Rock- advanced prototyping in the areas of low-power well-Collins, Cedar Rapids, IA, where he designed high-performance digital circuits and physical design navigation and electronic flight display systems. of multi-million transistors chips. He has coauthored From 1984 to 1991, he worked at National Semi- two papers in these areas. conductor, Santa Clara, CA, designing telecom components for ISDN. With Intel since 1992, he is currently a member of the Circuits Research Laboratory, Hillsboro, OR, engaged in a variety of advanced prototype design activities. Vasantha Erraguntla (M’92) received the Bache- lors degree in electrical engineering from Osmania University, India, and the Masters degree in computer engineering from University of Louisiana, Lafayette. She joined Intel in Hillsboro, OR, in 1991 and James Tschanz (M’97) received the B.S. degree in worked on the high-speed router technology for the computer engineering and the M.S. degree in elec- Intel Teraflop machine (ASCI Red). For over 10 trical engineering from the University of Illinois at years, she was engaged in a variety of advanced Urbana-Champaign in 1997 and 1999, respectively. prototype design activities at Intel Laboratories, Since 1999, he has been a Circuits Researcher implementing and validating research ideas in the with the Intel Circuit Research Lab, Hillsboro, OR. areas of in high-performance and low-power circuits His research interests include low-power digital cir- and high-speed signaling. She relocated to India in June 2004 to start up cuits, design techniques, and methods for tolerating the Bangalore Design Lab to help facilitate circuit research and prototype parameter variations. He is also an Adjunct Faculty development. She has co-authored seven papers and has two patents issued and Member at the Oregon Graduate Institute, Beaverton, four pending. OR, where he teaches digital VLSI design. Clark Roberts received the A.S.E.E.T. degree from Mt. Hood Community College, Gresham, OR, in David Finan received the A.S. degree in electronic 1985. While pursuing his education, he held several engineering technology from Portland Community technical positions at Tektronix Inc., Beaverton, OR, College, Portland, OR, in 1989. from 1978 to 1992. He joined Intel Corporation, Hillsboro, OR, In 1992, he joined Intel Corporation’s Supercom- working on the iWarp project in 1988. He started puter Systems Division, where he worked on the working on the Intel i486DX2 project in 1990, the research and design of clock distribution systems for Intel Teraflop project in 1993, for Intel Design Lab- massively parallel supercomputers. He worked on oratories in 1995, for Intel Advanced Methodology the system clock design for the ASCI Red TeraFlop and Engineering in 1996, on the Intel Timna project Supercomputer (Intel, DOE and Sandia). From 1996 in 1998, and for Intel Laboratories in 2000. to 2000, he worked as a Signal Integrity Engineer in Intel Corporation’s Micro- Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 41 processor Division, focusing on clock distribution for Intel microprocessor Nitin Borkar received the M.Sc. degree in physics package and motherboard designs. In 2001, he joined the circuit research from the University of Bombay, India, in 1982, and area of Intel Labs. He has been working on physical prototype design and the M.S.E.E. degree from Louisiana State University measurement systems for high-speed IO and terascale microprocessor research. in 1985. He joined Intel Corporation in Portland, OR, in 1986. He worked on the design of the i960 family of embedded microcontrollers. In 1990, he joined the Yatin Hoskote (M’96) received the B.Tech. degree i486DX2 microprocessor design team and led the de- in electrical engineering from the Indian Institute sign and the performance verification program. After of Technology, Bombay, and the M.S. and Ph.D. successful completion of the i486DX2 development, degrees in computer engineering from the University he worked on high-speed signaling technology for the of Texas at Austin. Teraflop machine. He now leads the prototype design team in the Circuit Re- He joined Intel in 1995 as a member of Strategic search Laboratory, developing novel technologies in the high-performance low- CAD Labs doing research in verification technolo- power circuit areas and applying those towards terascale computing research. gies. He is currently a Principal Engineer in the Ad- vanced Prototype design team in Intel’s Micropro- cessor Technology Lab working on next-generation network-on-chip technologies. Shekhar Borkar (M’97) was born in Mumbai, Dr. Hoskote received the Best Paper Award at the Design Automation Con- India. He received the B.S. and M.S. degrees in ference in 1999 and an Intel Achievement Award in 2006. He is Chair of the pro- physics from the University of Bombay, India, in gram committee for 2007 High Level Design Validation and Test Workshop and 1979, and the M.S. degree in electrical engineering a guest editor for IEEE Design & Test Special Issue on Multicore Interconnects. from the University of Notre Dame, Notre Dame, IN, in 1981. He is an Intel Fellow and Director of the Circuit Research Laboratories at Intel Corporation, Hills- boro, OR. He joined Intel in 1981, where he has worked on the design of the 8051 family of micro- controllers, iWarp multi-computer, and high-speed signaling technology for Intel supercomputers. He is an adjunct member of the faculty of the Oregon Graduate Institute, Beaverton. He has published 10 articles and holds 11 patents. Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
You can also read