• DocumentCode
    34732
  • Title

    LU Factorization with Partial Pivoting for a Multicore System with Accelerators

  • Author

    Kurzak, Jakub ; Luszczek, Piotr ; Faverge, Mathieu ; Dongarra, Jack

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., Univ. of Tennessee, Knoxville, TN, USA
  • Volume
    24
  • Issue
    8
  • fYear
    2013
  • fDate
    Aug. 2013
  • Firstpage
    1613
  • Lastpage
    1621
  • Abstract
    LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
  • Keywords
    graphics processing units; multiprocessing systems; parallel processing; scheduling; shared memory systems; AMD Magny Cours CPU; CPU core; GPU accelerator; LU factorization; NVIDIA Fermi GPU; block LU algorithm; canonical numerical procedure; data layout; dynamic task scheduling; fine-grain parallelization; graphics processing unit; high performance LINPACK benchmark; memory hierarchy; memory-bound CPU operation; multicore system; panel factorization component; partial pivoting; shared memory system; Dynamic scheduling; Graphics processing unit; Kernel; Layout; Libraries; Plasmas; Dynamic scheduling; GPU; Gaussian elimination; Graphics processing unit; Kernel; LU factorization; Layout; Libraries; Plasmas; Tiles; accelerator; manycore; multicore; partial pivoting;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.242
  • Filename
    6280548