Author_Institution :
Electr. Eng. Dept., South Valley Univ., Aswan, Egypt
Abstract :
Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards. This paper implements DCT and its inverse (IDCT) on our proposed Mat-Core processor using scalar/vector/matrix instruction sets. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. The data computation unit is organized in parallel lanes, which can execute scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix instructions. To show the scalability of Mat-Core architecture, the performance of DCT and IDCT are evaluated on Mat-Core with different number of parallel lanes (one, four, and eight lanes). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language). Our results show performances of 1.5, 5, 6.4 and 14.4 FLOPs/cycle on Mat-Core with single lane and 8-element vector registers, four lanes and 4×4 matrix registers, four lanes and 8×4 matrix registers, and eight lanes and 8×8 matrix registers, respectively. The maximum performance of the Mat-Core processor on DCT and IDCT represents 90% of the ideal value. Moreover, increasing the number of parallel lanes from one to four and then to eight results in speeding up the execution of DCT and IDCT by factors of 4.2 and 9.5, respectively, which indicates the scalability of Mat-Core architecture.
Keywords :
discrete cosine transforms; hardware description languages; instruction sets; inverse transforms; memory architecture; optimising compilers; parallel processing; performance evaluation; DCT implementation; Mat-Core architecture; Mat-Core processor; SystemC; address generation; data computation; data queues; discrete cosine transform; general-purpose scalar processor; image-video compression standards; matrix processing unit; memory latency; performance evaluation; registers; scalable matrix processor; scalar-vector-matrix instruction sets; system level modeling language; Clocks; Computer architecture; Discrete cosine transforms; Matrices; Performance evaluation; Pipelines; Registers; DCT/IDCT; SystemC implementation; high performance computing; performance evaluation; scalable architecture; vector/matrix processing;