DocumentCode
3242805
Title
Improving Performance of Matrix Multiplication and FFT on GPU
Author
Cui, Xiang ; Chen, Yifeng ; Mei, Hong
Author_Institution
Key Lab. of High Confidence Software Technol., Peking Univ., Beijing, China
fYear
2009
fDate
8-11 Dec. 2009
Firstpage
42
Lastpage
48
Abstract
In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.
Keywords
computer graphics; coprocessors; fast Fourier transforms; matrix multiplication; CUDA; GPU; Matrix Multiplication; NVIDIA GeForce GTX280; communication-intensive; computation-intensive; computer speed 393 GFLOPS; memory bandwidth intensive; single-precision FFT; single-precision matrix-matrix multiplication subprogram; Bandwidth; Computer science education; Educational technology; Hardware; Laboratories; Libraries; Programming profession; Software performance; Testing; Yarn; CUDA; FFT; GPU; matrix multiplication;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
Conference_Location
Shenzhen
ISSN
1521-9097
Print_ISBN
978-1-4244-5788-5
Type
conf
DOI
10.1109/ICPADS.2009.8
Filename
5395212
Link To Document