• DocumentCode
    560185
  • Title

    Fast implementation of DGEMM on Fermi GPU

  • Author

    Tan, Guangming ; Li, Linchuan ; Triechle, Sean ; Phillips, Everett ; Bao, Yungang ; Sun, Ninghui

  • fYear
    2011
  • fDate
    12-18 Nov. 2011
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library. We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).
  • Keywords
    coprocessors; mathematics computing; matrix multiplication; parallel architectures; scheduling; shared memory systems; CUDA algorithm; Fermi GPU architecture; Fermi memory hierarchy; double-precision matrix-matrix multiplication; instruction scheduling; microarchitecture benchmarks; native machine language; performance modeling; registers; shared memory; software pipelining; vector memory operation; Bandwidth; Graphics processing unit; Instruction sets; Memory management; Optimization; Registers; CUDA; GPU; high performance computing; matrix-matrix multiplication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
  • Conference_Location
    Seatle, WA
  • Electronic_ISBN
    978-1-4503-0771-0
  • Type

    conf

  • Filename
    6114452