• DocumentCode
    3242805
  • Title

    Improving Performance of Matrix Multiplication and FFT on GPU

  • Author

    Cui, Xiang ; Chen, Yifeng ; Mei, Hong

  • Author_Institution
    Key Lab. of High Confidence Software Technol., Peking Univ., Beijing, China
  • fYear
    2009
  • fDate
    8-11 Dec. 2009
  • Firstpage
    42
  • Lastpage
    48
  • Abstract
    In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.
  • Keywords
    computer graphics; coprocessors; fast Fourier transforms; matrix multiplication; CUDA; GPU; Matrix Multiplication; NVIDIA GeForce GTX280; communication-intensive; computation-intensive; computer speed 393 GFLOPS; memory bandwidth intensive; single-precision FFT; single-precision matrix-matrix multiplication subprogram; Bandwidth; Computer science education; Educational technology; Hardware; Laboratories; Libraries; Programming profession; Software performance; Testing; Yarn; CUDA; FFT; GPU; matrix multiplication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
  • Conference_Location
    Shenzhen
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4244-5788-5
  • Type

    conf

  • DOI
    10.1109/ICPADS.2009.8
  • Filename
    5395212