• DocumentCode
    656133
  • Title

    An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU Implementation

  • Author

    Kasagi, Akihiko ; Nakano, Kaoru ; Ito, Yu

  • Author_Institution
    Dept. of Inf. Eng., Hiroshima Univ., Higashi-Hiroshima, Japan
  • fYear
    2013
  • fDate
    1-4 Oct. 2013
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. This conventional algorithm simply performs three rounds of memory access for reading from a, reading from p, and writing in b. The main contribution of this paper is to present an optimal offline permutation algorithm running in O(n/w + L) time units using n threads on the HMM with width w and latency L. We also implement our optimal offline permutation algorithm on GeForce GTX-680 GPU and evaluate the performance. Quite surprisingly, our optimal offline permutation algorithm achieves better performance than the conventional algorithm in most permutations, although it performs 32 rounds of memory access. For example, the bit-reversal permutation for 4M float (32-bit) numbers can be completed in 780ms by our optimal permutation algorithm, while the conventional algorithm takes 2328ms. We can say that the experimental results of this paper provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.
  • Keywords
    graphics processing units; parallel architectures; CUDA-enabled GPU; GPU implementation; GeForce GTX-680 GPU; HMM; bit-reversal permutation; compute unified device architecture; constant factor; conventional algorithm; graphics processing unit; hierarchical memory machine; memory access; optimal offline permutation algorithm; theoretical parallel computing model; Arrays; Graphics processing units; Hidden Markov models; Instruction sets; Memory management; Pipelines; CUDA; GPU; Memory machine models; offline permutation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2013 42nd International Conference on
  • Conference_Location
    Lyon
  • ISSN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2013.9
  • Filename
    6687333