DocumentCode :
3242805
Title :
Improving Performance of Matrix Multiplication and FFT on GPU
Author :
Cui, Xiang ; Chen, Yifeng ; Mei, Hong
Author_Institution :
Key Lab. of High Confidence Software Technol., Peking Univ., Beijing, China
fYear :
2009
fDate :
8-11 Dec. 2009
Firstpage :
42
Lastpage :
48
Abstract :
In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.
Keywords :
computer graphics; coprocessors; fast Fourier transforms; matrix multiplication; CUDA; GPU; Matrix Multiplication; NVIDIA GeForce GTX280; communication-intensive; computation-intensive; computer speed 393 GFLOPS; memory bandwidth intensive; single-precision FFT; single-precision matrix-matrix multiplication subprogram; Bandwidth; Computer science education; Educational technology; Hardware; Laboratories; Libraries; Programming profession; Software performance; Testing; Yarn; CUDA; FFT; GPU; matrix multiplication;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
Conference_Location :
Shenzhen
ISSN :
1521-9097
Print_ISBN :
978-1-4244-5788-5
Type :
conf
DOI :
10.1109/ICPADS.2009.8
Filename :
5395212
Link To Document :
بازگشت