An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS

Page created by Miguel Moody

Home & Garden

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008 29

An 80-Tile Sub-100-W TeraFLOPS
Processor in 65-nm CMOS
Sriram R. Vangal, Member, IEEE, Jason Howard, Gregory Ruhl, Member, IEEE, Saurabh Dighe, Member, IEEE,
Howard Wilson, James Tschanz, Member, IEEE, David Finan, Arvind Singh, Member, IEEE, Tiju Jacob,
Shailendra Jain, Vasantha Erraguntla, Member, IEEE, Clark Roberts, Yatin Hoskote, Member, IEEE,
Nitin Borkar, and Shekhar Borkar, Member, IEEE

Abstract—This paper describes an integrated network-on-chip
architecture containing 80 tiles arranged as an 8 10 2-D array of
floating-point cores and packet-switched routers, both designed
to operate at 4 GHz. Each tile has two pipelined single-precision
floating-point multiply accumulators (FPMAC) which feature a
single-cycle accumulation loop for high throughput. The on-chip
2-D mesh network provides a bisection bandwidth of 2 Terabits/s.
The 15-FO4 design employs mesochronous clocking, fine-grained
clock gating, dynamic sleep transistors, and body-bias techniques.
In a 65-nm eight-metal CMOS process, the 275 mm2 custom
design contains 100 M transistors. The fully functional first silicon Fig. 1. NoC architecture.
achieves over 1.0 TFLOPS of performance on a range of bench-
marks while dissipating 97 W at 4.27 GHz and 1.07 V supply.
Index Terms—CMOS digital integrated circuits, crossbar router multiprocessors include the RAW [3], TRIPS [4], and ASAP
and network-on-chip (NoC), floating-point unit, interconnection, [5] projects. These tiled architectures show promise for greater
leakage reduction, MAC, multiply-accumulate. integration, high performance, good scalability and potentially
high energy efficiency.
I. INTRODUCTION With the increasing demand for interconnect bandwidth,
on-chip networks are taking up a substantial portion of system
power budget. The 16-tile MIT RAW on-chip network con-

T HE scaling of MOS transistors into the nanometer regime
opens the possibility for creating large scalable Net-
work-on-Chip (NoC) architectures [1] containing hundreds
sumes 36% of total chip power, with each router dissipating
40% of individual tile power [6]. The routers and the links of
the Alpha 21364 microprocessor consume about 20% of the
of integrated processing elements with on-chip communica- total chip power. With on-chip communication consuming a
tion. NoC architectures, with structured on-chip networks are significant portion of the chip power and area budgets, there is
emerging as a scalable and modular solution to global commu- a compelling need for compact, low power routers. At the same
nications within large systems-on-chip. The basic concept is time, while applications dictate the choice of the compute core,
to replace today’s shared buses with on-chip packet-switched the advent of multimedia applications, such as three-dimen-
interconnection networks [2]. NoC architectures use layered sional (3-D) graphics and signal processing, places stronger
protocols and packet-switched networks which consist of demands for self-contained, low-latency floating-point proces-
on-chip routers, links, and well defined network interfaces. As sors with increased throughput. A computational fabric built
shown in Fig. 1, the NoC architecture basic building block is the using these optimized building blocks is expected to provide
“network tile”. The tiles are connected to an on-chip network high levels of performance in an energy efficient manner. This
that routes packets between them. Each tile may consist of paper describes design details of an integrated 80-tile NoC
one or more compute cores and include logic responsible for architecture implemented in a 65-nm process technology. The
routing and forwarding the packets, based on the routing policy prototype is designed to deliver over 1.0 TFLOPS of average
of the network. The structured network wiring of such a NoC performance while dissipating less than 100 W.
design gives well-controlled electrical parameters that simpli- The remainder of the paper is organized as follows. Section II
fies timing and allows the use of high-performance circuits to gives an architectural overview of the 80-tile NoC and describes
reduce latency and increase bandwidth. Recent tile-based chip the key building blocks. The section also explains the FPMAC
unit pipeline and design optimizations used to accomplish
single-cycle accumulation. Router architecture details, NoC
Manuscript received April 16, 2007; revised September 27, 2007.
S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan,
communication protocol and packet formats are also described.
C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar are with the Microprocessor Section III describes chip implementation details, including the
Technology Laboratories, Intel Corporation, Hillsboro, OR 97124 USA (e-mail: high-speed mesochronous clock distribution network used in
sriram.r.vangal@intel.com). this design. Details of the circuits used for leakage power man-
A. Singh, T. Jacob, S. Jain, and V. Erraguntla are with the Microprocessor
Technology Laboratories, Intel Corporation, Bangalore 560037, India. agement in both logic and memory blocks are also discussed.
Digital Object Identifier 10.1109/JSSC.2007.910957 Section IV presents the chip measurement results. Section V
0018-9200/$25.00 © 2008 IEEE

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

30 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 2. NoC block diagram and tile architecture.

concludes by summarizing the NoC architecture along with key are 32-bit inputs in IEEE-754 single-precision format [9]. The
performance and power numbers. design is capable of sustained pipelined performance of one
FPMAC instruction every 250 ps. The multiplier is designed
II. NOC ARCHITECTURE using a Wallace tree of 4-2 carry-save adders. The well-matched
The NoC architecture (Fig. 2) contains 80 tiles arranged as an delays of each Wallace tree stage allow for highly efficient
8 10 2-D mesh network that is designed to operate at 4 GHz pipelining . Four Wallace tree stages are used to com-
[7]. Each tile consists of a processing engine (PE) connected to press the partial product bits to a sum and carry pair. Notice that
a 5-port router with mesochronous interfaces (MSINT), which the multiplier does not use a carry propagate adder at the final
forwards packets between the tiles. The 80-tile on-chip network stage. Instead, the multiplier retains the output in carry-save
enables a bisection bandwidth of 2 Terabits/s. The PE contains format and converts the result to base 32 (at stage ), prior to
two independent fully-pipelined single-precision floating-point accumulation. In an effort to achieve fast single-cycle accumu-
multiply-accumulator (FPMAC) units, 3 KB single-cycle in- lation, we first analyzed each of the critical operations involved
struction memory (IMEM), and 2 KB of data memory (DMEM). in conventional FPUs with the intent of eliminating, reducing
A 96-bit Very Long Instruction Word (VLIW) encodes up to or deferring the logic operations inside the accumulate loop
eight operations per cycle. With a 10-port (6-read, 4-write) reg- and identified the following three optimizations [8].
ister file, the architecture allows scheduling to both FPMACs, 1) The accumulator (stage ) retains the multiplier output in
simultaneous DMEM load and stores, packet send/receive from carry-save format and uses an array of 4-2 carry save adders
mesh network, program control, and dynamic sleep instructions. to “accumulate” the result in an intermediate format. This
A router interface block (RIB) handles packet encapsulation be- removes the need for a carry-propagate adder in the critical
tween the PE and router. The fully symmetric architecture al- path.
lows any PE to send (receive) instruction and data packets to 2) Accumulation is performed in base 32 system, converting
(from) any other tile. The 15 fan-out-of-4 (FO4) design uses a the expensive variable shifters in the accumulate loop to
balanced core and router pipeline with critical stages employing constant shifters.
performance setting semi-dynamic flip-flops. In addition, a scal- 3) The costly normalization step is moved outside the accu-
able low power mesochronous clock distribution is employed in mulate loop, where the accumulation result in carry-save is
a 65-nm eight-metal CMOS process that enables high integra- added (stage ), the sum normalized (stage ) and con-
tion and single-chip realization of the teraFLOPS processor. verted back to base 2 (stage ).
These optimizations allow accumulation to be implemented
A. FPMAC Architecture in just 15 FO4 stages. This approach also reduces the latency
The nine-stage pipelined FPMAC architecture (Fig. 3) uses a of dependent FPMAC instructions and enables a sustained mul-
single-cycle accumulate algorithm [8] with base 32 and internal tiply-add result (2 FLOPS) every cycle. Careful pipeline re-bal-
carry-save arithmetic with delayed addition. The FPMAC con- ancing allows removal of 3 pipe-stages resulting in a 25% la-
tains a fully pipelined multiplier unit (pipe stages ), and tency improvement over work in [8]. The dual FPMACs in each
a single-cycle accumulation loop , followed by pipelined PE provide 16 GFLOPS of aggregate performance and are crit-
addition and normalization units . Operands A and B ical to achieving the goal of teraFLOPS performance.

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 31

Fig. 3. FPMAC nine-stage pipeline with single-cycle accumulate loop.

B. Instruction Set TABLE I
INSTRUCTION TYPES AND LATENCY
The architecture defines a 96-bit VLIW which allows a max-
imum of up to eight operations to be issued every cycle. The in-
structions fall into one of the six categories (Table I): Instruction
issue to both floating-point units, simultaneous data memory
load and stores, packet send/receive via the on-die mesh net-
work, program control using jump and branch instructions, syn-
chronization primitives for data transfer between PEs and dy-
namic sleep instructions. The data path between the DMEM
and the register file supports transfer of two 32-bit data words
per cycle on each load (or store) instruction. The register file 3-bit destination ID field (DID) specifies the router exit port.
issues four 32-bit data words to the dual FPMACs per cycle, This field is updated at each hop. Flow control and buffer
while retiring two 32-bit results every cycle. The synchroniza- management between routers is debit-based using almost-full
tion instructions aid with data transfer between tiles and allow bits, which the receiver queue signals via two flow control bits
the PE to stall while waiting for data (WFD) to arrive. To aid when its buffers reach a specified threshold. Each
with power management, the architecture provides special in- header FLIT supports a maximum of 10 hops. A chained header
structions for dynamic sleep and wakeup of each PE, including (CH) bit in the packet provides support for larger number of
independent sleep control of each floating-point unit inside the hops. Processing engine control information including sleep
PE. The architecture allows any PE to issue sleep packets to any and wakeup control bits are specified in the FLIT_1 that fol-
other tile or wake it up for processing tasks. With the exception lows the header FLIT. The minimum packet size required by
of FPU instructions which have a pipelined latency of nine cy- the protocol is two FLITs. The router architecture places no
cles, most instructions execute in 1–2 cycles. restriction on the maximum packet size.
C. NoC Packet Format D. Router Architecture
Fig. 4 describes the NoC packet structure and routing pro- A 4 GHz five-port two-lane pipelined packet-switched router
tocol. The on-chip 2-D mesh topology utilizes a 5-port router core (Fig. 5) with phase-tolerant mesochronous links forms
based on wormhole switching, where each packet is subdivided the key communication fabric for the 80-tile NoC architecture.
into “FLITs” or “Flow control unITs”. Each FLIT contains six Each port has two 39-bit unidirectional point-to-point links.
control signals and 32 data bits. The packet header (FLIT_0) The input-buffered wormhole-switched router [18] uses two
allows for a flexible source-directed routing scheme, where a logical lanes (lane 0–1) for dead-lock free routing and a fully

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 4. NoC protocol: packet format and FLIT description.

TABLE II
ROUTER COMPARISON OVER WORK IN [11]

in [11] when ported and compared in the same 65-nm process
[12]. Results from comparison are summarized in Table II.

E. Mesochronous Communication
The 2-mm-long point-to-point unidirectional router links im-
plement a phase-tolerant mesochronous interface (Fig. 7). Four
Fig. 5. Five-port two-lane shared crossbar router architecture.
of the five router links are source synchronous, each providing
a strobe (Tx_clk) with 38 bits of data. To reduce active power,
Tx_clk is driven at half the clock rate. A 4-deep circular FIFO,
non-blocking crossbar switch with a total bandwidth of 80 GB/s built using transparent latches captures data on both edges of
bits GHz ports . Each lane has a 16 FLIT queue, the delayed link strobe at the receiver. The strobe delay and
arbiter and flow control logic. The router uses a 5-stage pipeline duty-cycle can be digitally programmed using the on-chip scan
with a two stage round-robin arbitration scheme that first binds chain. A synchronizer circuit sets the latency between the FIFO
an input port to an output port in each lane and then selects a write and read pointers to 1 or 2 cycles at each port, depending
pending FLIT from one of the two lanes. A shared data path on the phase of the arriving strobe with respect to the local clock.
architecture allows crossbar switch re-use across both lanes on A more aggressive low-latency setting reduces the synchroniza-
a per-FLIT basis. The router links implement a mesochronous tion penalty by one cycle. The interface includes the first stage
interface with first-in-first-out (FIFO) based synchronization at of the router pipeline.
the receiver.
The router core features a double-pumped crossbar switch F. Router Interface Block (RIB)
[10] to mitigate crossbar interconnect routing area. The The RIB is responsible for message-passing and aids with
schematic in Fig. 6(a) shows the 36-bit crossbar data bus synchronization of data transfer between the tiles and with
double-pumped at the fourth pipe-stage of the router by inter- power management at the PE level. Incoming 38-bit-wide
leaving alternate data bits using dual edge-triggered flip-flops, FLITs are buffered in a 16-entry queue, where demultiplexing
reducing crossbar area by 50%. In addition, the proposed based on the lane-ID and framing to 64 bits for data packets
router architecture shares the crossbar switch across both lanes (DMEM) and 96 bits for instruction packets (IMEM) are
on an individual FLIT basis. Combined application of both accomplished. The buffering is required during program ex-
ideas enables a compact 0.34 mm design, resulting in a 34% ecution since DMEM stores from the 10-port register file
reduction in router layout area as shown in Fig. 6(b), 26% fewer have priority over data packets received by the RIB. The unit
devices, 13% improvement in average power and one cycle decodes FLIT_1 (Fig. 4) of an incoming instruction packet
latency reduction (from 6 to 5 cycles) over the router design and generates several PE control signals. This allows the PE

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 33

Fig. 6. (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [11].

Fig. 7. Phase-tolerant mesochronous interface and timing diagram.

receiving a full data packet, the RIB generates a break signal to
continue execution, if the IMEM is in a stalled (WFD) mode.
Upon receipt of a sleep packet via the mesh network, the RIB
unit can also dynamically put the entire PE to sleep or wake it
up for processing tasks on demand.
III. DESIGN DETAILS
To allow 4 GHz operation, the entire core is designed using
hand-optimized data path macros. CMOS static gates are used
to implement most of the logic. However, critical registers in
Fig. 8. Semi-dynamic flip-flop (SDFF) schematic. the FPMAC and router logic utilize implicit-pulsed semi-dy-
namic flip-flops (SDFF) [13], [14]. The SDFF (Fig. 8) has a
to start execution (REN) at a specified IMEM address (PCA) dynamic master stage coupled to a pseudo-static slave stage.
and is enabled by the new program counter (NPC) bit. After The FPMAC accumulator register is built using data-inverting

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

34 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 9. (a) Global mesochronous clocking and (b) simulated clock arrival times.

rising edge-triggered SDFFs with synchronous reset and en-
able. The negative setup time of the flip-flop is taken advan-
tage of in the critical path. When compared to a conventional
static master–slave flip-flop, SDFF provides both shorter latency
and the capability of incorporating logic functions with min-
imum delay penalty, properties which are desirable in high-per-
formance digital designs.
The chip uses a scalable global mesochronous clocking tech-
nique, which allows for clock phase-insensitive communication
across tiles and synchronous operation within each tile. The
on-chip phase-locked loop (PLL) output [Fig. 9(a)] is routed
using horizontal M8 and vertical M7 spines. Each spine con-
sists of differential clocks for low duty-cycle variation along the
worst-case clock route of 26 mm. An opamp at each tile con- Fig. 10. Router and on-die network power management.
verts the differential clocks to a single ended clock with a 50%
duty cycle prior to distributing the clock across the tile using a
balanced H-tree. This clock distribution scales well as tiles are enable signals also activate the nMOS sleep transistors in the
added or removed. The worst case simulated global duty-cycle input queue arrays of both lanes. The 360 m nMOS sleep
variation is 3 ps and local clock skew within the tile is 4 ps. device in the register-file is sized to provide a 4.3X reduction in
Fig. 9(b) shows simulated clock arrival times for all 80 tiles at array leakage power with a 4% frequency impact. The global
4 GHz operation. Note that multiple cycles are required for the clock buffer feeding the router is finally gated at the tile level
global clock to propagate to all 80 tiles. The systematic clock based on port activity.
skews inherent in the distribution help spread peak currents due Each FPMAC implements unregulated sleep transistors with
to simultaneous clock switching over the entire cycle. no data retention [Fig. 11(a)]. A 6-cycle pipelined wakeup se-
Fine-grained clock gating [Fig. 9(a)], sleep transistor and quence largely mitigates current spikes over single-cycle re-ac-
body bias circuits [15] are used to reduce active and standby tivation scheme, while allowing floating point unit execution to
leakage power, which are controlled at full-chip, tile-slice, and start one-cycle into wakeup. A faster 3-cycle fast-wake option
individual tile levels based on workload. Each tile is partitioned is also supported. On the other hand, memory arrays use a reg-
into 21 smaller sleep regions with dynamic control of individual ulated active clamped sleep transistor circuit [Fig. 11(b)] that
blocks in PE and router units based on instruction type. The ensures data retention and minimizes standby leakage power
router is partitioned into 10 smaller sleep regions with control of [16]. The closed-loop opamp configuration ensures that the vir-
individual router ports, depending on network traffic patterns. tual ground voltage is no greater than a input
The design uses nMOS sleep transistors to reduce frequency voltage under PVT variations. is set based on memory
penalty and area overhead. Fig. 10 shows the router and on-die cell standby voltage. The average sleep transistor area
network power management scheme. The enable signals gate overhead is 5.4% with a 4% frequency penalty. About 90% of
the clock to each port, MSINT and the links. In addition, the FPU logic and 74% of each PE is sleep-enabled. In addition,

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 35

Fig. 11. (a) FPMAC pipelined wakeup diagram and (b) state-retentive memory clamp circuit.

Fig. 12. Full-chip and tile micrograph and characteristics.

forward body bias can be externally applied to all nMOS de- cide for lower resistance and a second-generation strained sil-
vices during active mode to increase the operating frequency icon technology. The interconnect uses eight copper layers and
and reverse body bias can be applied during idle mode for fur- a low- carbon-doped oxide inter-layer dielectric.
ther leakage savings. The functional blocks of the chip and individual tile are identi-
fied in the die photographs in Fig. 12. The 275 mm fully custom
IV. EXPERIMENTAL RESULTS design contains 100 million transistors. Using a fully-tiled ap-
The teraFLOPS processor is fabricated in a 65-nm process proach, each 3 mm tile is drawn complete with C4 bumps,
technology [12] with a 1.2-nm gate-oxide thickness, nickel sali- power, global clock and signal routing, which are seamlessly

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

36                                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 13. (a) Package die-side. (b) Package land-side. (c) Evaluation board.

arrayed by abutment. Each tile contains 1.2 million transistors
with the processing engine accounting for 1 million (83%) and
the router 17% of the total tile device count. De-coupling ca-
pacitors occupy about 20% of the total logic area. The chip has
three independent voltage regions, one for the tiles, a separate
supply for the PLL and a third one for the I/O circuits. Test and
debug features include a TAP controller and full-scan support
for all memory blocks on chip.
   The evaluation board with the packaged chip is shown in
Fig. 13. The die has 8390 C4 solder bumps, arrayed with a
single uniform bump pitch across the entire die. The chip-level
power distribution consists of a uniform M8-M7 grid aligned
with the C4 power and ground bump array. The package is a                                     Fig. 14. Measured chip F              and peak performance.
66 mm 66 mm flip-chip LGA (land grid array) and includes
an integrated heat spreader. The package has a 14-layer stack-up
                                                                                                                         TABLE III
(5-4-5) to meet the various power planes and signal require-                                  APPLICATION PERFORMANCE MEASURED AT 1.07 V AND 4.27 GHz OPERATION
ments and has a total of 1248 pins, out of which 343 are signal
pins. Decoupling capacitors are mounted on the land-side of the
package as shown in Fig. 13(b). A PC running custom software
is used to apply test vectors and observe results through the
on-chip scan chain. First silicon has been validated to be fully
functional.
   Frequency versus power supply on a typical part is shown
in Fig. 14. Silicon chip measurements at a case temperature
of 80 C demonstrates chip maximum frequency                    of
1 GHz at 670 mV and 3.16 GHz at 950 mV, with frequency
increasing to 5.1 GHz at 1.2 V and 5.67 GHz at 1.35 V. With
all 80 tiles              actively performing single precision                                  Several application kernels have been mapped to the design
block-matrix operations, the chip achieves a peak perfor-                                     and the performance is summarized in Table III. The table
mance of 0.32 TFLOPS (670 mV), 1.0 TFLOPS (950 mV),                                           shows the single-precision floating point operation count, the
1.63 TFLOPS (1.2 V) and 1.81 TFLOPS (1.35 V).                                                 number of active tiles ( ) and the average performance in

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 37

Fig. 16. Measured chip energy efficiency for stencil application.
Fig. 15. Stencil application performance measured at 1.07 V and 4.27 GHz
operation.

TFLOPS for each application, reported as a percentage of the
peak performance achievable with the design. In each case,
task mapping was hand optimized and communication was
overlapped with computation as much as possible to increase
efficiency. The stencil code solves a steady-state 2-D heat
diffusion equation with periodic boundary conditions on left
and right boundaries of a rectilinear grid, and prescribed
temperatures on top and bottom boundaries. For the stencil Fig. 17. Estimated (a) tile power profile and (b) communication power
kernel, chip measurements indicate an average performance breakdown.
of 1.0 TFLOPS at 4.27 GHz and 1.07 V supply with 358K
floating-point operations, achieving 73.3% of the peak per-
formance. This data is particularly impressive because the frequency scaling. As expected, the chip energy efficiency in-
execution is entirely overlapped with local loads and stores creases with power supply reduction, from 5.8 GFLOPS/W at
and communication between neighboring tiles. The SGEMM 1.35 V supply to 10.5 GFLOPS/W at the 1.0 TFLOPS goal
matrix multiplication code operates on two 100 100 matrices to a maximum of 19.4 GFLOPS/W at 750 mV supply. Below
with 2.63 million floating-point operations, corresponding to 750 mV, the chip degrades faster than power saved by
an average performance of 0.51 TFLOPS. It is important to lowering tile supply voltage, resulting in overall performance re-
note that the read bandwidth from local data memory limits duction and consequent drop in the processor energy efficiency.
the performance to half the peak rate. The spreadsheet kernel The chip provides up to 394 GFLOPS of aggregate performance
applies reductions to tables of data consisting of pairs of values at 750 mV with a measured total power dissipation of just 20 W.
and weights. For each table the weighted sum of each row Fig. 17 presents the estimated power breakdown at the tile and
and each column is computed. A 64-point 2-D FFT which router levels, which is simulated at 4 GHz, 1.2 V supply and at
implements the Cooley–Tukey algorithm [17] using 64 tiles has 110 C. The processing engine with the dual FPMACs, instruc-
also been successfully mapped to the design with an average tion and data memory, and the register file accounts for 61% of
performance of 27.3 GFLOPS. It first computes 8-point FFTs the total tile power [Fig. 17(a)]. The communication power is
in each tile, which in turn passes results to 63 other tiles for the significant at 28% of the tile power and the synchronous tile-
2-D FFT computation. The complex communication pattern level clock distribution accounts for 11% of the total. Fig. 17(b)
results in high overhead and low efficiency. shows a more detailed tile to tile communication power break-
Fig. 15 shows the total chip power dissipation with the ac- down, which includes the router, mesochronous interfaces and
tive and leakage power components separated as a function of links. Clocking power is the largest component, accounting for
frequency and power supply with case temperature maintained 33% of the communication power. The input queues on both
at 80 C. We report measured power for the stencil applica- lanes and data path circuits is the second major component dis-
tion kernel, since it is the most computationally intensive. The sipating 22% of the communication power.
chip power consumption ranges from 15.6 W at 670 mW to Fig. 18 shows the output differential and single ended clock
230 W at 1.35 V. With all 80 tiles actively executing stencil waveforms measured at the farthest clock buffer from the phase-
code the chip achieves 1.0 TFLOPS of average performance at locked loop (PLL) at a frequency of 5 GHz. Notice that the
4.27 GHz and 1.07 V supply with a total power dissipation of duty cycles of the clocks are close to 50%. Fig. 19 plots the
97 W. The total power consumed increases to 230 W at 1.35 V global clock distribution power as a function of frequency and
and 5.67 GHz operation, delivering 1.33 TFLOPS of average power supply. This is the switching power dissipated in the
performance. Fig. 16 plots the measured energy efficiency in clock spines from the PLL to the opamp at the center of all
GFLOPS/W for the stencil application with power supply and 80 tiles. Measured silicon data at 80 C shows that the power

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

38 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 18. Measured global clock distribution waveforms.

Fig. 19. Measured global clock distribution power. Fig. 20. Measured chip leakage power as percentage of total power versus Vcc.
A 2X reduction is obtained by turning off sleep transistors.

is 80 mW at 0.8 V and 1.7 GHz frequency, increasing by 10X
to 800 mW at 1 V and 3.8 GHz. The global clock distribution
power is 2 W at 1.2 V and 5.1 GHz and accounts for just 1.3%
of the total chip power. Fig. 20 plots the chip leakage power as
a percentage of the total power with all 80 processing engines
and routers awake and with all the clocks disabled. Measure-
ments show that the worst-case leakage power in active mode
varies from a minimum of 9.6% to a maximum of 15.7% of
the total power when measured over the power supply range of
670 mV to 1.35 V. In sleep mode, the nMOS sleep transistors
are turned off, reducing chip leakage by 2X, while preserving
the logic state in all memory arrays. Fig. 21 shows the active and
leakage power reduction due to a combination of selective router
port activation, clock gating and sleep transistor techniques de-
scribed in Section III. Measured at 1.2 V, 80 C and 5.1 GHz op- Fig. 21. On-die network power reduction benefit.
eration, the total network power per-tile can be lowered from a
maximum of 924 mW with all router ports active to 126 mW, re-
sulting in a 7.3X reduction. The network leakage power per-tile V. CONCLUSION
with all ports and global clock buffers feeding the router dis- In this paper, we have presented an 80-tile high-performance
abled is 126 mW. This number includes power dissipated in the NoC architecture implemented in a 65-nm process technology.
router, MSINT and the links. The prototype contains 160 lower-latency FPMAC cores

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS                                                                                                        39

and features a single-cycle accumulator architecture for high                                     [9] IEEE Standard for Binary Floating-Point Arithmetic IEEE Standards
throughput. Each tile also contains a fast and compact router                                         Board, New York, Tech. Rep. ANSI/IEEE Std. 754-1985, 1985.
                                                                                                 [10] S. Vangal, N. Borkar, and A. Alvandpour, “A six-port 57 GB/s double-
operating at core speed where the 80 tiles are interconnected                                         pumped non-blocking router core,” in Symp. VLSI Circuits, Jun. 2005,
using a 2-D mesh topology providing a high bisection band-                                            pp. 268–269.
width of over 2 Terabits/s. The design uses a combination                                        [11] H. Wilson and M. Haycock, “A six-port 30-GB/s non-blocking router
of micro-architecture, logic, circuits and a 65-nm process to                                         component using point-to-point simultaneous bidirectional signaling
                                                                                                      for high-bandwidth interconnects,” IEEE J. Solid-State Circuits, vol.
reach target performance. Silicon operates over a wide voltage                                        36, no. 12, pp. 1954–1963, Dec. 2001.
and frequency range, and delivers teraFLOPS performance                                          [12] P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane,
with high power efficiency. For the most computationally                                              R. Heussner, M. Hussein, J. Hwang, D. Ingerly, R. James, J. Jeong,
                                                                                                      C. Kenyon, E. Lee, S.-H. Lee, N. Lindert, M. Liu, Z. Ma, T. Marieb,
intensive application kernel, the chip achieves an average per-                                       A. Murthy, R. Nagisetty, S. Natarajan, J. Neirynck, A. Ott, C. Parker,
formance of 1.0 TFLOPS, while dissipating 97 W at 4.27 GHz                                            J. Sebastian, R. Shaheed, S. Sivakumar, J. Steigerwald, S. Tyagi, C.
and 1.07 V supply, corresponding to an energy efficiency of                                           Weber, B. Woolery, A. Yeoh, K. Zhang, and M. Bohr, “A 65 nm logic
10.5 GFLOPS/W. Average performance scales to 1.33 TFLOPS                                              technology featuring 35 nm gate lengths, enhanced channel strain, 8 Cu
                                                                                                      interconnect layers, low-k ILD and 0.57 m SRAM cell,” in IEDM
at a maximum operational frequency of 5.67 GHz and 1.35 V                                             Tech. Dig., Dec. 2004, pp. 657–660.
supply. These results demonstrate the feasibility for high-per-                                  [13] F. Klass, “Semi-dynamic and dynamic flip-flops with embedded logic,”
formance and energy-efficient building blocks for peta-scale                                          in Symp. VLSI Circuits Dig. Tech. Papers, 1998, pp. 108–109.
                                                                                                 [14] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De,
computing in the near future.
                                                                                                      “Comparative delay and energy of single edge-triggered and dual edge-
                                                                                                      triggered pulsed flip-flops for high-performance microprocessors,” in
                                                                                                      Proc. ISLPED, 2001, pp. 147–151.
                                                                                                 [15] J. Tschanz, S. Narendra, Y. Ye, B. Bloechel, S. Borkar, and V. De, “Dy-
                            ACKNOWLEDGMENT                                                            namic sleep transistor and body bias for active leakage power control
                                                                                                      of microprocessors,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp.
   The authors would like to thank V. De, Prof. A. Alvandpour,                                        1838–1845, Nov. 2003.
                                                                                                 [16] M. Khellah, D. Somasekhar, Y. Ye, N. Kim, J. Howard, G. Ruhl, M.
D. Somasekhar, D. Jenkins, P. Aseron, J. Collias, B. Nefcy,                                           Sunna, J. Tschanz, N. Borkar, F. Hamzaoglu, G. Pandya, A. Farhang,
P. Iyer, S. Venkataraman, S. Saha, M. Haycock, J. Schutz, and                                         K. Zhang, and V. De, “A 256-Kb dual-Vcc SRAM building block in
J. Rattner for help, encouragement, and support; T. Mattson,                                          65-nm CMOS process with actively clamped sleep transistor,” IEEE J.
R. Wijngaart, and M. Frumkin from SSG and ARL teams at Intel                                          Solid-State Circuits, vol. 42, no. 1, pp. 233–242, Jan. 2007.
                                                                                                 [17] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calcula-
for assistance with mapping the kernels to the design; the LTD                                        tion of complex Fourier series,” Math. Comput., vol. 19, pp. 297–301,
and ATD teams for PLL and package design and assembly; the                                            1965.
entire mask design team for chip layout; and the reviewers for                                   [18] S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar, and A. Alvand-
                                                                                                      pour, “A 5.1 GHz 0.34 mm router for network-on-chip applications,”
their useful remarks.
                                                                                                      in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2007, pp. 42–43.

                                  REFERENCES

   [1] L. Benini and G. De Micheli, “Networks on chips: A new SoC para-
       digm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002.
   [2] W. J. Dally and B. Towles, “Route packets, not wires: On-chip inter-                                             Sriram R. Vangal (S’90–M’98) received the B.S.
       connection networks,” in Proc. 38th Design Automation Conf., Jun.                                                degree from Bangalore University, India, the M.S.
       2001, pp. 681–689.                                                                                               degree from the University of Nebraska, Lincoln,
   [3] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-                                             and the Ph.D. degree from Linköping University,
       wald, H. Hoffmann, P. Johnson, J. W. Lee, W. Lee, A. Ma, A. Saraf,                                               Sweden, all in electrical engineering. His Ph.D. re-
       M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe,                                                  search focused on energy-efficient network-on-chip
       and A. Agarwal, “The Raw microprocessor: A computational fabric                                                  (NoC) designs.
       for software circuits and general-purpose programs,” IEEE Micro, vol.                                               With Intel since 1995, he is currently a Senior Re-
                                                                                                                        search Scientist at Microprocessor Technology Labs,
       22, no. 2, pp. 25–35, Mar.-Apr. 2002.
                                                                                                                        Hillsboro, OR. He was the technical lead for the ad-
   [4] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger,
                                                                                                                        vanced prototype team that designed the industry’s
       S. W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with                    first single-chip Teraflops processor. His research interests are in the area of
       the polymorphous TRIPS architecture,” in Proc. 30th Annu. Int. Symp.                  low-power high-performance circuits, power-aware computing and NoC archi-
       Computer Architecture, 2003, pp. 422–433.                                             tectures. He has published 16 papers and has 16 issued patents with 7 pending
   [5] Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E.                      in these areas.
       Work, T. Mohsenin, M. Singh, and B. Baas, “An asynchronous array
       of simple processors for DSP applications,” in IEEE ISSCC Dig. Tech.
       Papers, Feb. 2006, pp. 428–429.
   [6] H. Wang, L. Peh, and S. Malik, “Power-driven design of router mi-
                                                                                                                            Jason Howard received the M.S.E.E. degree from
       croarchitectures in on-chip networks,” in MICRO-36, Proc. 36th Annu.                                                 Brigham Young University, Provo, UT, in 2000.
       IEEE/ACM Int. Symp. Micro Architecture, 2003, pp. 105–116.                                                             He was an Intern with Intel Corporation during the
   [7] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz,                                                      summers of 1998 and 1999 working on the Pentium
       D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y.                                                  4 microprocessor. In 2000, he formally joined Intel,
       Hoskote, and N. Borkar, “An 80-tile 1.28 TFLOPS network-on-chip                                                      working in the Oregon Rotation Engineers Program.
       in 65 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp.                                                      After two successful rotations through NCG and
       98–99.                                                                                                               CRL, he officially joined Intel Laboratories, Hills-
   [8] S. Vangal, Y. Hoskote, N. Borkar, and A. Alvandpour, “A 6.2-GFLOPS                                                   boro, OR, in 2001. He is currently working for the
       floating-point multiply-accumulator with conditional normalization,”                                                 CRL Prototype Design team. He has co-authored
       IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2314–2323, Oct. 2006.                                             several papers and patents pending.

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

40 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Gregory Ruhl (M’07) received the B.S. degree in Arvind Singh (M’05) received the B.Tech. degree in
computer engineering and the M.S. degree in elec- electronics and communication engineering from the
trical and computer engineering from the Georgia In- Institute of Engineering and Technology, Lucknow,
stitute of Technology, Atlanta, in 1998 and 1999, re- India, in 2000, and the M.Tech. degree in VLSI de-
spectively. sign from the Indian Institute of Technology, Delhi,
He joined Intel Corporation, Hillsboro, OR, in India, in 2001. He holds the University Gold Medal
1999 as a part of the Rotation Engineering Program for his B.Tech. degree.
where he worked on the PCI-X I/O switch, Gigabit He is a Research Scientist at the Microprocessor
Ethernet validation, and individual circuit research Technology Labs, Intel Bangalore, India, working
projects. After completing the REP program, he on scalable on-die interconnect fabric for terascale
joined Intel’s Circuits Research Lab where he has research and prototyping. His research interests
been working on design, research and validation on a variety of topics ranging include network-on-chip technologies and energy-efficient building blocks.
from SRAMs and signaling to terascale computing. Prior to Intel, he worked on high-performance network search engines and
low-power SRAMs in Cypress Semiconductors.
Mr. Singh is a member of the IEEE and TiE.

Saurabh Dighe (M’05) received the B.E. degree in
electronics engineering from University of Mumbai,
Tiju Jacob received the B.Tech. degree in electronics
India, in 2001, and the M.S. degree in computer engi-
and communication engineering from NIT, Calicut,
neering from University of Minnesota, Minneapolis,
India, in 2000, and the M.Tech. degree in microelec-
in 2003. His work focused on computer architecture
tronics and VLSI design from IIT Madras, India, in
and the design, modeling and simulation of digital
2003.
systems.
He joined Intel Corporation, Bangalore, India, in
He was with Intel Corporation, Santa Clara, CA,
Feb, 2003 and currently a member of Circuit Re-
working on front-end logic and validation method-
search Labs. His areas of interests include low-power
ologies for the Itanium processor and the Core pro-
high-performance circuits, many core processors
cessor design team. Currently, he is with Intel Corpo-
and advanced prototyping.
ration’s Circuit Research Labs, Microprocessor Technology Labs in Hillsboro,
OR and is a member of the prototype design team involved in the definition,
implementation and validation of future terascale computing technologies.

Shailendra Jain received the B.E. degree in
electronics engineering from the Devi Ahilya
Vishwavidyalaya, Indore, India, in 1999, and the
Howard Wilson was born in Chicago, IL, in 1957. M.Tech. degree in electrical engineering from the
He received the B.S. degree in electrical engineering IIT, Madras, India, in 2001.
from Southern Illinois University, Carbondale, in Since 2004, he has been with Intel, Bangalore
1979. Design Lab, India. His work includes research and
From 1979 to 1984 he worked at Rock- advanced prototyping in the areas of low-power
well-Collins, Cedar Rapids, IA, where he designed high-performance digital circuits and physical design
navigation and electronic flight display systems. of multi-million transistors chips. He has coauthored
From 1984 to 1991, he worked at National Semi- two papers in these areas.
conductor, Santa Clara, CA, designing telecom
components for ISDN. With Intel since 1992, he
is currently a member of the Circuits Research
Laboratory, Hillsboro, OR, engaged in a variety of advanced prototype design
activities. Vasantha Erraguntla (M’92) received the Bache-
lors degree in electrical engineering from Osmania
University, India, and the Masters degree in computer
engineering from University of Louisiana, Lafayette.
She joined Intel in Hillsboro, OR, in 1991 and
James Tschanz (M’97) received the B.S. degree in worked on the high-speed router technology for the
computer engineering and the M.S. degree in elec- Intel Teraflop machine (ASCI Red). For over 10
trical engineering from the University of Illinois at years, she was engaged in a variety of advanced
Urbana-Champaign in 1997 and 1999, respectively. prototype design activities at Intel Laboratories,
Since 1999, he has been a Circuits Researcher implementing and validating research ideas in the
with the Intel Circuit Research Lab, Hillsboro, OR. areas of in high-performance and low-power circuits
His research interests include low-power digital cir- and high-speed signaling. She relocated to India in June 2004 to start up
cuits, design techniques, and methods for tolerating the Bangalore Design Lab to help facilitate circuit research and prototype
parameter variations. He is also an Adjunct Faculty development. She has co-authored seven papers and has two patents issued and
Member at the Oregon Graduate Institute, Beaverton, four pending.
OR, where he teaches digital VLSI design.

Clark Roberts received the A.S.E.E.T. degree from
Mt. Hood Community College, Gresham, OR, in
David Finan received the A.S. degree in electronic 1985. While pursuing his education, he held several
engineering technology from Portland Community technical positions at Tektronix Inc., Beaverton, OR,
College, Portland, OR, in 1989. from 1978 to 1992.
He joined Intel Corporation, Hillsboro, OR, In 1992, he joined Intel Corporation’s Supercom-
working on the iWarp project in 1988. He started puter Systems Division, where he worked on the
working on the Intel i486DX2 project in 1990, the research and design of clock distribution systems for
Intel Teraflop project in 1993, for Intel Design Lab- massively parallel supercomputers. He worked on
oratories in 1995, for Intel Advanced Methodology the system clock design for the ASCI Red TeraFlop
and Engineering in 1996, on the Intel Timna project Supercomputer (Intel, DOE and Sandia). From 1996
in 1998, and for Intel Laboratories in 2000. to 2000, he worked as a Signal Integrity Engineer in Intel Corporation’s Micro-

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS 41

processor Division, focusing on clock distribution for Intel microprocessor Nitin Borkar received the M.Sc. degree in physics
package and motherboard designs. In 2001, he joined the circuit research from the University of Bombay, India, in 1982, and
area of Intel Labs. He has been working on physical prototype design and the M.S.E.E. degree from Louisiana State University
measurement systems for high-speed IO and terascale microprocessor research. in 1985.
He joined Intel Corporation in Portland, OR, in
1986. He worked on the design of the i960 family
of embedded microcontrollers. In 1990, he joined the
Yatin Hoskote (M’96) received the B.Tech. degree i486DX2 microprocessor design team and led the de-
in electrical engineering from the Indian Institute sign and the performance verification program. After
of Technology, Bombay, and the M.S. and Ph.D. successful completion of the i486DX2 development,
degrees in computer engineering from the University he worked on high-speed signaling technology for the
of Texas at Austin. Teraflop machine. He now leads the prototype design team in the Circuit Re-
He joined Intel in 1995 as a member of Strategic search Laboratory, developing novel technologies in the high-performance low-
CAD Labs doing research in verification technolo- power circuit areas and applying those towards terascale computing research.
gies. He is currently a Principal Engineer in the Ad-
vanced Prototype design team in Intel’s Micropro-
cessor Technology Lab working on next-generation
network-on-chip technologies. Shekhar Borkar (M’97) was born in Mumbai,
Dr. Hoskote received the Best Paper Award at the Design Automation Con- India. He received the B.S. and M.S. degrees in
ference in 1999 and an Intel Achievement Award in 2006. He is Chair of the pro- physics from the University of Bombay, India, in
gram committee for 2007 High Level Design Validation and Test Workshop and 1979, and the M.S. degree in electrical engineering
a guest editor for IEEE Design & Test Special Issue on Multicore Interconnects. from the University of Notre Dame, Notre Dame,
IN, in 1981.
He is an Intel Fellow and Director of the Circuit
Research Laboratories at Intel Corporation, Hills-
boro, OR. He joined Intel in 1981, where he has
worked on the design of the 8051 family of micro-
controllers, iWarp multi-computer, and high-speed
signaling technology for Intel supercomputers. He is an adjunct member of
the faculty of the Oregon Graduate Institute, Beaverton. He has published
10 articles and holds 11 patents.

Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

You can also read