Improving Performance of Matrix Multiplication and FFT on GPU

Author

Cui, Xiang ; Chen, Yifeng ; Mei, Hong

Author_Institution

Key Lab. of High Confidence Software Technol., Peking Univ., Beijing, China

fYear

2009

fDate

8-11 Dec. 2009

Firstpage

42

Lastpage

48

Abstract

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.

Keywords

computer graphics; coprocessors; fast Fourier transforms; matrix multiplication; CUDA; GPU; Matrix Multiplication; NVIDIA GeForce GTX280; communication-intensive; computation-intensive; computer speed 393 GFLOPS; memory bandwidth intensive; single-precision FFT; single-precision matrix-matrix multiplication subprogram; Bandwidth; Computer science education; Educational technology; Hardware; Laboratories; Libraries; Programming profession; Software performance; Testing; Yarn; CUDA; FFT; GPU; matrix multiplication;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on

Conference_Location

Shenzhen

ISSN

1521-9097

Print_ISBN

978-1-4244-5788-5

Type

conf

DOI

10.1109/ICPADS.2009.8

Filename

5395212