Title :
An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS
Author :
Vangal, S.R. ; Howard, J. ; Ruhl, G. ; Dighe, S. ; Wilson, H. ; Tschanz, J. ; Finan, D. ; Singh, A. ; Jacob, T. ; Jain, S. ; Erraguntla, V. ; Roberts, C. ; Hoskote, Y. ; Borkar, N. ; Borkar, S.
Author_Institution :
Intel Corp., Hillsboro
Abstract :
This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.
Keywords :
CMOS digital integrated circuits; field effect MMIC; network-on-chip; 80-tile sub-100-W TeraFLOPS processor; CMOS; body-bias techniques; dynamic sleep transistors; fine-grained clock gating; floating-point cores; frequency 4 GHz; frequency 4.27 GHz; integrated network-on-chip architecture; mesochronous clocking; on-chip 2D mesh network; packet-switched routers; pipelined single-precision floating-point multiply accumulators; power 97 W; single-cycle accumulation loop; size 65 nm; voltage 1.07 V; Bandwidth; CMOS process; Integrated circuit interconnections; Jacobian matrices; Microprocessors; Network-on-a-chip; System-on-a-chip; Throughput; CMOS digital integrated circuits; MAC; crossbar router and network-on-chip (NoC); floating-point unit; interconnection; leakage reduction; multiply-accumulate;
Journal_Title :
Solid-State Circuits, IEEE Journal of
DOI :
10.1109/JSSC.2007.910957