DocumentCode :
3144119
Title :
Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer
Author :
Mamidala, Amith R. ; Faraj, Daniel ; Kumar, Sameer ; Miller, Douglas ; Blocksome, Michael ; Gooding, Thomas ; Heidelberger, Philip ; Dozsa, Gabor
Author_Institution :
T.J. Watson Res. Center, IBM, Yorktown Heights, NY, USA
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
771
Lastpage :
780
Abstract :
The Blue Gene/P (BG/P) supercomputer consists of thousands of compute nodes interconnected by multiple networks. Out of these, a 3D torus equipped with direct memory access (DMA) engine is the primary network. BG/P also features a collective network which supports hardware accelerated collective operations such as broadcast and all reduce. One of the operating modes on BG/P is the virtual node mode where the four cores can be active MPI tasks, performing inter-node and intra-node communication. This paper proposes software techniques to enhance MPI Collective communication primitives, MPI Bcast and MPI Allreduce in virtual node mode by using cache coherent memory subsystem as the communication method within the node. The paper describes techniques leveraging atomic operations to design concurrent data structures such as broadcast-FIFOs to enable efficient collectives. Such mechanisms are important as we expect the core counts to rise in the future and having such data structures makes programming easier and efficient. We also demonstrate the utility of shared address space techniques for MPI collectives, wherein a process can access the peer´s memory by specialized system calls. Apart from cutting down the copy costs, such techniques allow for seamless integration of network protocols with intra-node communication methods. We propose intra-node extensions to multi-color network algorithms for collectives using light weight synchronizing structures and atomic operations. Further, we demonstrate that shared address techniques allow for good load balancing and are critical for efficiently using the hardware collective network on BG/P. When compared to current approaches on the 3D torus, our optimizations provide performance up to almost 3 folds for MPI Bcast and a 33% performance gain for MPI Allreduce(in virtual node mode). We also see improvements up to 44% for MPI Bcast using the collective tree network.
Keywords :
cache storage; message passing; parallel machines; resource allocation; tree data structures; 3D torus; Blue Gene/P supercomputer; MPI collective communication primitives; MPI collective optimization; MPI_Allreduce; MPI_Bcast; atomic operations; broadcast-FIFO; cache coherent memory subsystem; collective tree network; direct memory access engine; intranode communication techniques; load balancing; Algorithm design and analysis; Engines; Hardware; Kernel; Peer to peer computing; Radiation detectors; Three dimensional displays;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.220
Filename :
6008919
Link To Document :
بازگشت