Title :
A 280mV-to-1.1V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22nm CMOS
Author :
Hsu, Steven ; Agarwal, Amit ; Anders, Mark ; Mathew, Sanu ; Kaul, Himanshu ; Sheikh, Farhana ; Krishnamurthy, Ram
Author_Institution :
Intel, Hillsboro, OR, USA
Abstract :
Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy sav- ngs and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.
Keywords :
CMOS integrated circuits; parallel processing; reconfigurable architectures; 2D shuffle; CMOS; clockless static reads; dual-ended transmission gate writes; dynamic operating range; energy savings; energy-efficient SIMD permutation operation; frequency 1.8 GHz; frequency 10 MHz; frequency 2.9 GHz; graphics; high-throughput data rearrangement operation; horizontal shuffle; interleaved folded byte-wise multiplexer layout; matrix transpose algorithm; microprocessor vector datapath utilization; min-sized transistor variation; multimedia; nominal register file performance; peak energy efficiency; peak throughput; power 106 mW; power 19 muW; power 69 mW; read/write operation; reconfigurable SIMD vector permutation engine; robust register file functionality; scalable permute crossbar performance; signal processing workloads; size 22 nm; stacked min-delay buffers; subthreshold operation; temperature 50 C; ultralow voltage split-output level shifters; vector flip-flops; vertical shuffle; voltage 0.9 V; voltage 150 mV; voltage 240 mV; voltage 260 mV; voltage 280 mV to 1.1 V; Energy efficiency; Energy measurement; Engines; Frequency measurement; Registers; Vectors; Voltage measurement;
Conference_Titel :
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International
Conference_Location :
San Francisco, CA
Print_ISBN :
978-1-4673-0376-7
DOI :
10.1109/ISSCC.2012.6176966