A High Performance and Memory Efficient LU Decomposer on FPGAs

Author

Wu, Guiming ; Dou, Yong ; Sun, Junqing ; Peterson, Gregory D.

Author_Institution

Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China

Volume

61

Issue

3

fYear

2012

fDate

3/1/2012 12:00:00 AM

Firstpage

366

Lastpage

378

Abstract

LU decomposition for dense matrices is an important linear algebra kernel that is widely used in both scientific and engineering applications. To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, a block LU decomposition algorithm on FPGAs applicable to arbitrary matrix size is proposed. Our algorithm applies a series of transformations, including loop blocking and space-time mapping, onto sequential nonblocking LU decomposition. We also introduce a high performance and memory efficient hardware architecture, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. Our design can achieve optimum performance under various hardware resource constraints. Furthermore, our algorithm and design can be easily extended to the multi-FPGA platform by using a block-cyclic data distribution and inter-FPGA communication scheme. A total of 36 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz for a matrix size of 16,384, which outperforms several general-purpose processors. For a Xilinx Virtex-6 XC6VLX760, a newer FPGA, we predict that a total of 180 PEs can be integrated, reaching 70.66 GFLOPS at 200 MHz. Compared to the previous work, our design can integrate twice the number of PEs into the same FPGA and has significantly higher performance.

Keywords

field programmable gate arrays; matrix decomposition; 70.66 GFLOPS; 8.50 GFLOPS; Xilinx Virtex-5 XC5VLX330; Xilinx Virtex-6 XC6VLX760; arbitrary matrix size; block LU decomposition algorithm; block-cyclic data distribution; dense matrices; engineering applications; frequency 133 MHz; frequency 200 MHz; general-purpose processors; hardware resource constraints; inter-FPGA communication scheme; linear algebra kernel; local memory; loop blocking; matrix lower-upper decomposition; memory efficient hardware architecture; processing elements; scientific applications; self-designed PCI-express card; sequential nonblocking LU decomposition; space-time mapping; Arrays; Field programmable gate arrays; Hardware; Matrix decomposition; Program processors; FPGA; LU decomposition; linear array.; loop blocking; space-time mapping;

fLanguage

English

Journal_Title

Computers, IEEE Transactions on

Publisher

ieee

ISSN

0018-9340

Type

jour

DOI

10.1109/TC.2010.278

Filename

5674024