FPGA implementation and evaluation of a simple processor for multi-scalar/vector/matrix instructions

Author

Soliman, Mostafa I. ; Elsayed, Elsayed A.

Author_Institution

Comput. Sci. & Inf. Dept., Taibah Univ., Al-Madinah Al-Munawwarah, Saudi Arabia

fYear

2014

fDate

19-20 April 2014

Firstpage

1

Lastpage

7

Abstract

On FPGA, this paper presents the implementation of a simple processor architecture for accelerating data-parallel applications. Our proposed processor called SuperSMP, which can execute multi-scalar, vector, and matrix instructions on parallel execution datapaths. 4×32-bit instructions are fetched from instruction cache. The fetched instructions are decoded and their dependencies are checked. Up to four independent scalar instructions can be issued in-order to the parallel execution units. However, vector/matrix instructions iterate the issuing of four vector/matrix operations without checking. On four parallel execution units, SuperSMP can perform addition, subtraction, multiplication, division, and shifting on scalar/vector/matrix data. 4×32-bit contiguous vector/matrix elements can be loaded/stored per clock cycle from/to L2 cache to/from matrix register file. Finally, up to 4×32-bit results or loaded data can be written into scalar/matrix register files. The FPGA implementation of our proposed SuperSMP requires 14,032 slices on Xilinx Virtex-5, XC5VLX110-3FF1153. The number of LUT flip-flop pairs is 49,398, where 17,166, 10,267, and 21,965, are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of SuperSMP is about 3.5 times of the baseline scalar processor. However, the performance of SuperSMP ranges from 4.3 to 18.2 times higher than the baseline scalar processor.

Keywords

application specific integrated circuits; field programmable gate arrays; flip-flops; integrated logic circuits; table lookup; FPGA implementation; LUT flip-flop pairs; SuperSMP; XC5VLXllO-3FF1l53; Xilinx Virtex-5; addition; baseline scalar processor; data-parallel applications; division; matrix register file; multiplication; multiscalar-vector-matrix instructions; scalar-vector-matrix data shifting; simple processor architecture; subtraction; Field programmable gate arrays; Frequency synthesizers; Kernel; Loading; Parallel processing; Table lookup; Vectors; FPGA; data-parallel applications; performance evaluation; vector/matrix processing;

fLanguage

English

Publisher

ieee

Conference_Titel

Engineering and Technology (ICET), 2014 International Conference on

Conference_Location

Cairo

Type

conf

DOI

10.1109/ICEngTechnol.2014.7016776

Filename

7016776