مرکز منطقه ای اطلاع رساني علوم و فناوري - Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

DocumentCode :

3144119

Title :

Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

Author :

Mamidala, Amith R. ; Faraj, Daniel ; Kumar, Sameer ; Miller, Douglas ; Blocksome, Michael ; Gooding, Thomas ; Heidelberger, Philip ; Dozsa, Gabor

Author_Institution :

T.J. Watson Res. Center, IBM, Yorktown Heights, NY, USA

fYear :

2011

fDate :

16-20 May 2011

Firstpage :

771

Lastpage :

780

Abstract :

The Blue Gene/P (BG/P) supercomputer consists of thousands of compute nodes interconnected by multiple networks. Out of these, a 3D torus equipped with direct memory access (DMA) engine is the primary network. BG/P also features a collective network which supports hardware accelerated collective operations such as broadcast and all reduce. One of the operating modes on BG/P is the virtual node mode where the four cores can be active MPI tasks, performing inter-node and intra-node communication. This paper proposes software techniques to enhance MPI Collective communication primitives, MPI Bcast and MPI Allreduce in virtual node mode by using cache coherent memory subsystem as the communication method within the node. The paper describes techniques leveraging atomic operations to design concurrent data structures such as broadcast-FIFOs to enable efficient collectives. Such mechanisms are important as we expect the core counts to rise in the future and having such data structures makes programming easier and efficient. We also demonstrate the utility of shared address space techniques for MPI collectives, wherein a process can access the peer´s memory by specialized system calls. Apart from cutting down the copy costs, such techniques allow for seamless integration of network protocols with intra-node communication methods. We propose intra-node extensions to multi-color network algorithms for collectives using light weight synchronizing structures and atomic operations. Further, we demonstrate that shared address techniques allow for good load balancing and are critical for efficiently using the hardware collective network on BG/P. When compared to current approaches on the 3D torus, our optimizations provide performance up to almost 3 folds for MPI Bcast and a 33% performance gain for MPI Allreduce(in virtual node mode). We also see improvements up to 44% for MPI Bcast using the collective tree network.

Keywords :

cache storage; message passing; parallel machines; resource allocation; tree data structures; 3D torus; Blue Gene/P supercomputer; MPI collective communication primitives; MPI collective optimization; MPI_Allreduce; MPI_Bcast; atomic operations; broadcast-FIFO; cache coherent memory subsystem; collective tree network; direct memory access engine; intranode communication techniques; load balancing; Algorithm design and analysis; Engines; Hardware; Kernel; Peer to peer computing; Radiation detectors; Three dimensional displays;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on

Conference_Location :

Shanghai

ISSN :

1530-2075

Print_ISBN :

978-1-61284-425-1

Electronic_ISBN :

1530-2075

Type :

conf

DOI :

10.1109/IPDPS.2011.220

Filename :

6008919

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3144119