Title :
Hardware implementation of MPI_Barrier on an FPGA cluster
Author :
Gao, Shanyuan ; Schmidt, Andrew G. ; Sass, Ron
Author_Institution :
Electr. & Comput. Eng. Dept., Univ. of North Carolina at Charlotte, Charlotte, NC, USA
fDate :
Aug. 31 2009-Sept. 2 2009
Abstract :
Message-Passing is the dominant programming model for distributed memory parallel computers and Message-Passing Interface (MPI) is the standard. Along with point-to-point send and receive message primitives, MPI includes a set of collective communication operations that are used to synchronize and coordinate groups of tasks. The MPI_Barrier, one of the most important collective procedures, has been extensively studied on a variety of architectures over last twenty years. However, a cluster of Platform FPGAs is a new architecture and offers interesting, resource-efficient options for implementing the barrier operation. This paper describes an FPGA implementation of MPI_Barrier. The premise is that barrier (and other collective communication operations) are very sensitive to latency as the number of nodes scales to the tens-of-thousands. The relatively slow processors found on FPGAs will significantly cap performance. The FPGA hardware design implements a tree-based algorithm and is tightly integrated with the custom high-speed on-chip/off-chip network. MPI access is available through a specially-designed kernel module. This effectively offloads the work from the CPU and OS into hardware. The evaluation of this design shows significant performance gains compared with a conventional software implementation on both an FPGA cluster and a commodity cluster. Further, it suggests that moving other MPI collective operations into hardware would be beneficial.
Keywords :
application program interfaces; field programmable gate arrays; message passing; microprocessor chips; FPGA cluster; MPI_barrier operation; collective communication operation; distributed memory parallel computer; kernel module; message passing interface; on-chip/off-chip network; tree-based algorithm; Algorithm design and analysis; Computer interfaces; Concurrent computing; Delay; Distributed computing; Field programmable gate arrays; Hardware; Kernel; Network-on-a-chip; Parallel programming;
Conference_Titel :
Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on
Conference_Location :
Prague
Print_ISBN :
978-1-4244-3892-1
Electronic_ISBN :
1946-1488
DOI :
10.1109/FPL.2009.5272560