• DocumentCode
    1411096
  • Title

    A High Performance and Memory Efficient LU Decomposer on FPGAs

  • Author

    Wu, Guiming ; Dou, Yong ; Sun, Junqing ; Peterson, Gregory D.

  • Author_Institution
    Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
  • Volume
    61
  • Issue
    3
  • fYear
    2012
  • fDate
    3/1/2012 12:00:00 AM
  • Firstpage
    366
  • Lastpage
    378
  • Abstract
    LU decomposition for dense matrices is an important linear algebra kernel that is widely used in both scientific and engineering applications. To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, a block LU decomposition algorithm on FPGAs applicable to arbitrary matrix size is proposed. Our algorithm applies a series of transformations, including loop blocking and space-time mapping, onto sequential nonblocking LU decomposition. We also introduce a high performance and memory efficient hardware architecture, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. Our design can achieve optimum performance under various hardware resource constraints. Furthermore, our algorithm and design can be easily extended to the multi-FPGA platform by using a block-cyclic data distribution and inter-FPGA communication scheme. A total of 36 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz for a matrix size of 16,384, which outperforms several general-purpose processors. For a Xilinx Virtex-6 XC6VLX760, a newer FPGA, we predict that a total of 180 PEs can be integrated, reaching 70.66 GFLOPS at 200 MHz. Compared to the previous work, our design can integrate twice the number of PEs into the same FPGA and has significantly higher performance.
  • Keywords
    field programmable gate arrays; matrix decomposition; 70.66 GFLOPS; 8.50 GFLOPS; Xilinx Virtex-5 XC5VLX330; Xilinx Virtex-6 XC6VLX760; arbitrary matrix size; block LU decomposition algorithm; block-cyclic data distribution; dense matrices; engineering applications; frequency 133 MHz; frequency 200 MHz; general-purpose processors; hardware resource constraints; inter-FPGA communication scheme; linear algebra kernel; local memory; loop blocking; matrix lower-upper decomposition; memory efficient hardware architecture; processing elements; scientific applications; self-designed PCI-express card; sequential nonblocking LU decomposition; space-time mapping; Arrays; Field programmable gate arrays; Hardware; Matrix decomposition; Program processors; FPGA; LU decomposition; linear array.; loop blocking; space-time mapping;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2010.278
  • Filename
    5674024