Title :
A High Performance and Memory Efficient LU Decomposer on FPGAs
Author :
Wu, Guiming ; Dou, Yong ; Sun, Junqing ; Peterson, Gregory D.
Author_Institution :
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
fDate :
3/1/2012 12:00:00 AM
Abstract :
LU decomposition for dense matrices is an important linear algebra kernel that is widely used in both scientific and engineering applications. To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, a block LU decomposition algorithm on FPGAs applicable to arbitrary matrix size is proposed. Our algorithm applies a series of transformations, including loop blocking and space-time mapping, onto sequential nonblocking LU decomposition. We also introduce a high performance and memory efficient hardware architecture, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. Our design can achieve optimum performance under various hardware resource constraints. Furthermore, our algorithm and design can be easily extended to the multi-FPGA platform by using a block-cyclic data distribution and inter-FPGA communication scheme. A total of 36 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz for a matrix size of 16,384, which outperforms several general-purpose processors. For a Xilinx Virtex-6 XC6VLX760, a newer FPGA, we predict that a total of 180 PEs can be integrated, reaching 70.66 GFLOPS at 200 MHz. Compared to the previous work, our design can integrate twice the number of PEs into the same FPGA and has significantly higher performance.
Keywords :
field programmable gate arrays; matrix decomposition; 70.66 GFLOPS; 8.50 GFLOPS; Xilinx Virtex-5 XC5VLX330; Xilinx Virtex-6 XC6VLX760; arbitrary matrix size; block LU decomposition algorithm; block-cyclic data distribution; dense matrices; engineering applications; frequency 133 MHz; frequency 200 MHz; general-purpose processors; hardware resource constraints; inter-FPGA communication scheme; linear algebra kernel; local memory; loop blocking; matrix lower-upper decomposition; memory efficient hardware architecture; processing elements; scientific applications; self-designed PCI-express card; sequential nonblocking LU decomposition; space-time mapping; Arrays; Field programmable gate arrays; Hardware; Matrix decomposition; Program processors; FPGA; LU decomposition; linear array.; loop blocking; space-time mapping;
Journal_Title :
Computers, IEEE Transactions on
DOI :
10.1109/TC.2010.278