An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU Implementation

Author

Kasagi, Akihiko ; Nakano, Kaoru ; Ito, Yu

Author_Institution

Dept. of Inf. Eng., Hiroshima Univ., Higashi-Hiroshima, Japan

fYear

2013

fDate

1-4 Oct. 2013

Firstpage

1

Lastpage

10

Abstract

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. This conventional algorithm simply performs three rounds of memory access for reading from a, reading from p, and writing in b. The main contribution of this paper is to present an optimal offline permutation algorithm running in O(n/w + L) time units using n threads on the HMM with width w and latency L. We also implement our optimal offline permutation algorithm on GeForce GTX-680 GPU and evaluate the performance. Quite surprisingly, our optimal offline permutation algorithm achieves better performance than the conventional algorithm in most permutations, although it performs 32 rounds of memory access. For example, the bit-reversal permutation for 4M float (32-bit) numbers can be completed in 780ms by our optimal permutation algorithm, while the conventional algorithm takes 2328ms. We can say that the experimental results of this paper provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.

Keywords

graphics processing units; parallel architectures; CUDA-enabled GPU; GPU implementation; GeForce GTX-680 GPU; HMM; bit-reversal permutation; compute unified device architecture; constant factor; conventional algorithm; graphics processing unit; hierarchical memory machine; memory access; optimal offline permutation algorithm; theoretical parallel computing model; Arrays; Graphics processing units; Hidden Markov models; Instruction sets; Memory management; Pipelines; CUDA; GPU; Memory machine models; offline permutation;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel Processing (ICPP), 2013 42nd International Conference on

Conference_Location

Lyon

ISSN

0190-3918

Type

conf

DOI

10.1109/ICPP.2013.9

Filename

6687333