• DocumentCode
    2255451
  • Title

    Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU

  • Author

    Matsumoto, Kazuya ; Nakasato, Naohito ; Sedukhin, Stanislav G.

  • Author_Institution
    Grad. Sch. of Comput. Sci. & Eng., Univ. of Aizu Aizu-Wakamatsu City, Aizu-Wakamatsu, Japan
  • fYear
    2012
  • fDate
    20-22 Sept. 2012
  • Firstpage
    198
  • Lastpage
    204
  • Abstract
    This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precisionGEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.
  • Keywords
    graphics processing units; matrix algebra; program compilers; GPU global memory; OpenCL; Radeon HD 7970; SGEMM; code generator; code generator for fast general matrix multiply; fast matrix multiplication; matrix data; single-precision GEMM; Bandwidth; Generators; Graphics processing units; High definition video; Kernel; Layout; Search engines; GPU; OpenCL; auto-tuning; matrix multiplication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Embedded Multicore Socs (MCSoC), 2012 IEEE 6th International Symposium on
  • Conference_Location
    Aizu-Wakamatsu
  • Print_ISBN
    978-1-4673-2535-6
  • Electronic_ISBN
    978-0-7695-4800-5
  • Type

    conf

  • DOI
    10.1109/MCSoC.2012.30
  • Filename
    6354699