Title : 
Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU
         
        
            Author : 
Matsumoto, Kazuya ; Nakasato, Naohito ; Sedukhin, Stanislav G.
         
        
            Author_Institution : 
Grad. Sch. of Comput. Sci. & Eng., Univ. of Aizu Aizu-Wakamatsu City, Aizu-Wakamatsu, Japan
         
        
        
        
        
        
            Abstract : 
This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precisionGEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.
         
        
            Keywords : 
graphics processing units; matrix algebra; program compilers; GPU global memory; OpenCL; Radeon HD 7970; SGEMM; code generator; code generator for fast general matrix multiply; fast matrix multiplication; matrix data; single-precision GEMM; Bandwidth; Generators; Graphics processing units; High definition video; Kernel; Layout; Search engines; GPU; OpenCL; auto-tuning; matrix multiplication;
         
        
        
        
            Conference_Titel : 
Embedded Multicore Socs (MCSoC), 2012 IEEE 6th International Symposium on
         
        
            Conference_Location : 
Aizu-Wakamatsu
         
        
            Print_ISBN : 
978-1-4673-2535-6
         
        
            Electronic_ISBN : 
978-0-7695-4800-5
         
        
        
            DOI : 
10.1109/MCSoC.2012.30