Title :
Padding free bank conflict resolution for CUDA-based matrix transpose algorithm
Author :
Khan, Ajmal ; Al-Mouhamed, Mayez ; Fatayar, A. ; Almousa, A. ; Baqais, A. ; Assayony, M.
Author_Institution :
Dept. of Comput. Eng., King Fahd Univ. of Pet. & Miner., Dhahran, Saudi Arabia
fDate :
June 30 2014-July 2 2014
Abstract :
Matrix Transposition is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on Graphic Processing Units (GPUs). The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict - free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T × T) of the problem space. However, to the best of our knowledge an extra space of Tx(T +1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive Gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.
Keywords :
graphics processing units; mathematics computing; matrix algebra; parallel architectures; shared memory systems; storage allocation; CUDA-based matrix transpose algorithm; GPU; NVIDIA SDK bank conflict-free matrix transpose; coalesced access; computational engineering application; computational science application; conflict free bank access; graphic processing units; linear algebra procedure; matrix transposition; memory access pattern; padding free bank conflict resolution; recursive Gaussian implementation; shared memory allocation; shared streaming multiprocessor memory; Algorithm design and analysis; Graphics processing units; Indexes; Instruction sets; Kernel; Linear algebra; Writing; Bank conflict free; CUDA GPU; coalesced memory access; linear Algebra solvers; matrix transpose; solving system of linear equations;
Conference_Titel :
Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2014 15th IEEE/ACIS International Conference on
Conference_Location :
Las Vegas, NV
DOI :
10.1109/SNPD.2014.6888709