• DocumentCode
    3204855
  • Title

    Communication-Avoiding QR Decomposition for GPUs

  • Author

    Anderson, Michael ; Ballard, Grey ; Demmel, James ; Keutzer, Kurt

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., UC Berkeley, Berkeley, CA, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    48
  • Lastpage
    58
  • Abstract
    We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel´s Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel´s Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.
  • Keywords
    computer graphic equipment; coprocessors; principal component analysis; singular value decomposition; video signal processing; CULA parallel linear algebra library; GPU; Intel math kernel library; QR handle panel factorization; bandwidth-bound operations; communication-avoiding QR decomposition; communication-avoiding QR factorization; compute-bound kernels; data transfer overheads; general-purpose processor; graphics processing unit; singular value decomposition; statistical approach; tall-skinny matrix; video background subtraction; Graphics processing unit; Instruction sets; Kernel; Libraries; Matrix decomposition; Vegetation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
  • Conference_Location
    Anchorage, AK
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-372-8
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.15
  • Filename
    6012824