Title :
Auto-Tuning GEMV on Many-Core GPU
Author :
Weizhi Xu ; Zhiyong Liu ; Jun Wu ; Xiaochun Ye ; Shuai Jiao ; Da Wang ; Fenglong Song ; Dongrui Fan
Author_Institution :
State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China
Abstract :
GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lead to very different performances. The matrix-vector multiplication routine for general dense matrices (GEMV) is a building block for many scientific and engineering computations. We find that the implementations of GEMV in CUBLAS 4.0 or MAGMA are not efficient, especially for small matrix or fat matrix (a matrix with small number of rows and large number of columns). In this paper, we propose two new algorithms to optimize GEMV on Fermi GPU. Instead of using only one thread, we use a warp to compute an element of vector y. We also propose a novel register blocking method to accelerate GEMV on GPU further. The proposed optimization methods for GEMV are comprehensively evaluated on the matrices with different sizes. Experiment results show that the new methods can achieve over 10x speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or MAGMA, and the new register blocking method can also perform better than CUBLAS 4.0 or MAGMA for large square matrices. We also propose a performance-tuning framework on how to choose an optimal algorithm of GEMV for an arbitrary input matrix on GPU.
Keywords :
graphics processing units; mathematics computing; matrix multiplication; multiprocessing systems; optimisation; parallel algorithms; software performance evaluation; vectors; Fermi GPU; arbitrary input matrix; autotuning GEMV; data parallel algorithms; engineering computations; fat matrices; general dense matrices; many-core GPU; matrix-vector multiplication routine; register blocking method; scientific computations; small square matrices; Algorithm design and analysis; Computer architecture; Graphics processing units; Instruction sets; Kernel; Registers; Vectors; GEMV; GPU; Performance Tuning;
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4673-4565-1
Electronic_ISBN :
1521-9097
DOI :
10.1109/ICPADS.2012.15