Title :
A 280 mV-to-1.1 V 256b Reconfigurable SIMD Vector Permutation Engine With 2-Dimensional Shuffle in 22 nm Tri-Gate CMOS
Author :
Hsu, S.K. ; Agarwal, Abhishek ; Anders, Mark A. ; Mathew, Sanu K. ; Kaul, Himanshu ; Sheikh, Farhana ; Krishnamurthy, Ram K.
Author_Institution :
Circuit Res. Lab., Intel Corp., Hillsboro, OR, USA
Abstract :
An ultra-low voltage reconfigurable 4-way to 32-way SIMD vector permutation engine is fabricated in 22 nm tri-gate bulk CMOS, consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clock-less static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250 mV across PVT variations with a wide dynamic operating range of 280 mV-1.1 V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates, and ultra-low voltage split-output (ULVS) level shifters improving logic VMIN by 150 mV, while enabling peak energy efficiency of 585 GOPS/W measured at 260 mV, 50 °C. The permutation engine achieves: (i) nominal register file performance of 1.8 GHz, 106 mW measured at 0.9 V, 50 °C, (ii) robust register file functionality measured down to 280 mV with peak energy efficiency of 154 GOPS/W, (iii) scalable permute crossbar performance of 2.9 GHz, 69 mW measured at 1.1 V, 50 °C with sub-threshold operation at 240 mV, 10 MHz consuming 19 μW, and (iv) a 64b 4 × 4 matrix transpose algorithm and AoS to SoA conversion with 40%-53% energy savings and 25%-42% improved peak throughput measured at 1.8 GHz, 0.9 V.
Keywords :
CMOS integrated circuits; flip-flops; low-power electronics; parallel processing; 2-dimensional shuffle; DETG; P/N dual-ended transmission gate; ULVS level shifter; clock-less static reads; frequency 1.8 GHz; frequency 10 MHz; frequency 2.9 GHz; interleaved folded byte-wise multiplexer layout; peak energy efficiency; power 106 mW; power 19 muW; power 69 mW; register file; scalable permute crossbar; shared gates; size 22 nm; stacked min-delay buffer; temperature 50 C; trigate bulk CMOS; ultra-low voltage reconfigurable SIMD vector permutation engine; ultra-low voltage split-output level shifter; vector flip-flops; voltage 0.9 V; voltage 240 mV; voltage 260 mV; voltage 280 mV to 1.1 V; Energy measurement; Engines; Logic gates; Multiplexing; Registers; Vectors; Voltage measurement; ${rm V}_{rm MIN}$; Single instruction multiple data (SIMD); crossbar; flip-flop; level shifter; near-threshold voltage (NTV); permutation; register file; ultra-low voltage; vector processing;
Journal_Title :
Solid-State Circuits, IEEE Journal of
DOI :
10.1109/JSSC.2012.2222811