مرکز منطقه ای اطلاع رساني علوم و فناوري - Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

DocumentCode :

2243098

Title :

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Author :

Cui, Xiang ; Chen, Yifeng ; Zhang, Changyou ; Mei, Hong

Author_Institution :

Key Lab. of High Confidence Software Technol., Peking Univ., Beijing, China

fYear :

2010

fDate :

8-10 Dec. 2010

Firstpage :

237

Lastpage :

242

Abstract :

In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on the other hand renders the dependency of performance on algorithmic parameters less predictable. Auto tuning then becomes a useful technique to address this issue. Our auto-tuned SGEMM and DGEMM reach 563 GFlops and 253 GFlops respectively on Tesla C2050. The design and implementation entirely use CUDA and C and have not benefited from tuning at the level of binary code.

Keywords :

cache storage; coprocessors; matrix multiplication; CUDA; DGEMM; Fermi architecture; GPGPU; Tesla C2050; auto-tuned SGEMM; auto-tuning dense matrix multiplication; cache affect performance; data locality; CUDA; Fermi; GPU; autotuning; matrix multiplication;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on

Conference_Location :

Shanghai

ISSN :

1521-9097

Print_ISBN :

978-1-4244-9727-0

Electronic_ISBN :

1521-9097

Type :

conf

DOI :

10.1109/ICPADS.2010.64

Filename :

5695608

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2243098