DocumentCode
2255451
Title
Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU
Author
Matsumoto, Kazuya ; Nakasato, Naohito ; Sedukhin, Stanislav G.
Author_Institution
Grad. Sch. of Comput. Sci. & Eng., Univ. of Aizu Aizu-Wakamatsu City, Aizu-Wakamatsu, Japan
fYear
2012
fDate
20-22 Sept. 2012
Firstpage
198
Lastpage
204
Abstract
This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precisionGEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.
Keywords
graphics processing units; matrix algebra; program compilers; GPU global memory; OpenCL; Radeon HD 7970; SGEMM; code generator; code generator for fast general matrix multiply; fast matrix multiplication; matrix data; single-precision GEMM; Bandwidth; Generators; Graphics processing units; High definition video; Kernel; Layout; Search engines; GPU; OpenCL; auto-tuning; matrix multiplication;
fLanguage
English
Publisher
ieee
Conference_Titel
Embedded Multicore Socs (MCSoC), 2012 IEEE 6th International Symposium on
Conference_Location
Aizu-Wakamatsu
Print_ISBN
978-1-4673-2535-6
Electronic_ISBN
978-0-7695-4800-5
Type
conf
DOI
10.1109/MCSoC.2012.30
Filename
6354699
Link To Document