• DocumentCode
    62432
  • Title

    Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator

  • Author

    Pedram, Ardavan ; Gerstlauer, Andreas ; Van De Geijn, Robert A.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Texas at Austin, Austin, TX, USA
  • Volume
    63
  • Issue
    8
  • fYear
    2014
  • fDate
    Aug. 1 2014
  • Firstpage
    1854
  • Lastpage
    1867
  • Abstract
    This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations and their blocked algorithms. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design tradeoffs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study of inner kernels is extended to blocked level and shows that, at block level, the Linear Algebra Core (LAC) can achieve high efficiencies with up to 45 GFLOPS/W for both Cholesky and LU factorization, and over 35 GFLOPS/W for QR factorization. While maintaining such efficiencies, our extensions to the MAC units can achieve up to 10, 12, and 20 percent speedup for the blocked algorithms of Cholesky, LU, and QR factorization, respectively.
  • Keywords
    application specific integrated circuits; computational complexity; computer architecture; floating point arithmetic; integrated circuit design; least mean squares methods; linear algebra; linear systems; matrix decomposition; Cholesky factorization; LU factorization; QR factorizations; algorithm mapping; blocked algorithms; dense linear systems; floating-point unit codesign; full-custom ASIC designs; linear algebra core; linear algebra processor; linear least-square problems; matrix factorization accelerator; partial pivoting; Algorithm design and analysis; Complexity theory; Computer architecture; Kernel; Matrix decomposition; Registers; Cholesky factorization; LU factorization; Low-power design; QR factorization; energy-aware systems; floating-point arithmetic; matrix decomposition; partial pivoting; performance analysis and design aids; special-purpose hardware;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2014.2315627
  • Filename
    6782713