• DocumentCode
    2178537
  • Title

    Optimization and Implementation of LBM Benchmark on Multithreaded GPU

  • Author

    Ren, Xiaoguang ; Tang, Yuhua ; Wang, Guibin ; Tang, Tao ; Fang, Xudong

  • Author_Institution
    Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2010
  • fDate
    9-10 Feb. 2010
  • Firstpage
    116
  • Lastpage
    122
  • Abstract
    With fast development of transistor technology, Graphic Processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, such as Brook+ for AMD GPU. Compared with the traditional parallel systems, heterogeneous systems integerating stream-based multi-threaded GPUs provide higher parallel computing capabilities with lower cost. However, porting traditional applications to the heterogeneous systems makes new demand of application optimization on GPU. Based on the AMD´s Brook+ platform, we explored application optimization features on AMD GPU by optimizing and implementing the benchmark LBM from SPEC2006. To improve the program locality, we optimized the original data layout of LBM. Using the short vector data types mechanism provided by Brook+, we also optimized the GPU´s bandwidth utilization and its thread processors´ efficiency. Through the branch elimination technique, we reduced the performance lose caused by branch divergences in the kernel, which is due to the GPU´s SIMD executing mode. The experiment results show that data layout, memory bandwidth, branch paths and other factors have a close effect on the performance of program execution on the GPU. Through all the optimizations, we finally got a speedup of 22x (single-precision) and 19x (double-precision) over the original serial benchmark code on a Quad-core CPU, and a speedup of 4x (single-precision) and 8.7x (double-precision) over the original OMP benchmark code on a 8-core CPU.
  • Keywords
    coprocessors; multi-threading; parallel processing; AMD GPU; AMD´s Brook+ platform; LBM benchmark; application optimization; branch elimination; graphic processing unit; heterogeneous systems; nongraphics applications; parallel computing; parallel systems; program locality; software stacks; stream-based multithreaded GPU; transistor technology; Application software; Bandwidth; Biomedical computing; Computational fluid dynamics; Computational modeling; Computer architecture; Engines; Hardware; Parallel processing; Yarn; Brook+; GPU; LBM; optimization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Storage and Data Engineering (DSDE), 2010 International Conference on
  • Conference_Location
    Bangalore
  • Print_ISBN
    978-1-4244-5678-9
  • Type

    conf

  • DOI
    10.1109/DSDE.2010.45
  • Filename
    5452598