Author_Institution :
Comput. Sci. & Inf. Dept., Taibah Univ., Al-Madinah Al-Munawwarah, Saudi Arabia
Abstract :
On FPGA, this paper presents the implementation of a simple processor architecture for accelerating data-parallel applications. Our proposed processor called SuperSMP, which can execute multi-scalar, vector, and matrix instructions on parallel execution datapaths. 4×32-bit instructions are fetched from instruction cache. The fetched instructions are decoded and their dependencies are checked. Up to four independent scalar instructions can be issued in-order to the parallel execution units. However, vector/matrix instructions iterate the issuing of four vector/matrix operations without checking. On four parallel execution units, SuperSMP can perform addition, subtraction, multiplication, division, and shifting on scalar/vector/matrix data. 4×32-bit contiguous vector/matrix elements can be loaded/stored per clock cycle from/to L2 cache to/from matrix register file. Finally, up to 4×32-bit results or loaded data can be written into scalar/matrix register files. The FPGA implementation of our proposed SuperSMP requires 14,032 slices on Xilinx Virtex-5, XC5VLX110-3FF1153. The number of LUT flip-flop pairs is 49,398, where 17,166, 10,267, and 21,965, are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of SuperSMP is about 3.5 times of the baseline scalar processor. However, the performance of SuperSMP ranges from 4.3 to 18.2 times higher than the baseline scalar processor.
Keywords :
application specific integrated circuits; field programmable gate arrays; flip-flops; integrated logic circuits; table lookup; FPGA implementation; LUT flip-flop pairs; SuperSMP; XC5VLXllO-3FF1l53; Xilinx Virtex-5; addition; baseline scalar processor; data-parallel applications; division; matrix register file; multiplication; multiscalar-vector-matrix instructions; scalar-vector-matrix data shifting; simple processor architecture; subtraction; Field programmable gate arrays; Frequency synthesizers; Kernel; Loading; Parallel processing; Table lookup; Vectors; FPGA; data-parallel applications; performance evaluation; vector/matrix processing;