• DocumentCode
    234515
  • Title

    A Hierarchical Tridiagonal System Solver for Heterogenous Supercomputers

  • Author

    Xinliang Wang ; Yangtong Xu ; Wei Xue

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2014
  • fDate
    17-17 Nov. 2014
  • Firstpage
    69
  • Lastpage
    76
  • Abstract
    Tridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale tridiagonal system on heterogenous supercomputers. In this paper, a hierarchical algorithm framework SPIKE (pronounced ´SPIKE squared´) is proposed to minimize the parallel overhead and to achieve the best utilization of CPU-GPU hybrid systems. In these systems, a layered and adaptive partitioning is presented based on the SPIKE algorithm to effectively control the sequential parts while efficiently exploiting the computation and communication overlapping in heterogeneous computing node. Moreover, the SPIKE algorithm is reformulated to reduce the matrix computations to only 1/3 in our hierarchical algorithm framework. Meanwhile, an improved implementation of the tiled-PCR-pThomas algorithm is employed for the GPU architecture, and the shared memory usage on the GPU can be reduced by 1/3 using careful dependence analysis on solving unit vector tridiagonal systems. Our experiments on Tianhe-1A show ideal weak scalability on up to 128 nodes when solving a tridiagonal system with a size of 1920M in the largest run and good strong scalability (70%) from 32 nodes to 256 nodes when solving a tridiagonal system with a size of 480M. Furthermore, the adaptive task partition across the CPU and GPU can get over 10% performance improvement in the strong scaling test with 256 nodes. In one computing node of Tianhe-1A, our GPU-only code can outperform the CUSPARSE version (non-pivoting tridiagonal solver) by 30%, and our hybrid code is about 6.7 times faster than the Intel SPIKE multi-process version for tridiagonal systems having a size of 3M, 5M, and 15M.
  • Keywords
    parallel algorithms; parallel machines; CPU-GPU hybrid systems; SPIKE algorithm; Tianhe-1A; heterogeneous computing node; heterogenous supercomputers; hierarchical algorithm framework; hierarchical tridiagonal system solver; parallel algorithms; tiled-PCR-pThomas algorithm; unit vector tridiagonal systems; Clustering algorithms; Equations; Graphics processing units; Mathematical model; Matrix decomposition; Partitioning algorithms; Vectors; Tridiagonal system; Heterogeneous supercomputer; GPU; Tianhe-1A;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2014 5th Workshop on
  • Conference_Location
    New Orleans, LA
  • Type

    conf

  • DOI
    10.1109/ScalA.2014.12
  • Filename
    7016736