A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems

Author

Kandalla, Krishna ; Subramoni, Hari ; Tomko, Karen ; Pekurovsky, Dmitry ; Panda, Dhabaleswar K.

Author_Institution

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear

2013

fDate

1-4 Oct. 2013

Firstpage

611

Lastpage

620

Abstract

Non-blocking collectives have been recently standardized by the Message Passing Interface (MPI) Forum. However, intelligent designs offered by the MPI communication runtimes are likely to be the key factors that drive their adoption. While hardware based solutions for non-blocking collective operations have shown promise, they require specialized hardware support and currently have several performance and scalability limitations. Alternatively, researchers have proposed software-based, Functional Partitioning solutions for non-blocking collectives, that rely on spare cores in each node to progress non-blocking collectives. However, these designs also require additional memory resources, and involve expensive copy operations. Such factors limit the overall performance and scalability benefits associated with using non-blocking collectives in MPI. In this paper, we propose a high performance, shared-memory backed, user-level approach based on functional partitioning, to design MPI-3 non-blocking collectives. Our approach relies on using one ``Communication Servlet (CS)" thread per node to seamlessly execute the non-blocking collective operations on behalf of the application processes. Our design also eliminates the need for additional memory resources and expensive copy operations between the application processes and the CS. We demonstrate that our solution can deliver near-perfect computation/communication overlap with large message, dense collective operations, such as MPI_Ialltoallv, while using just one core per node. We also study the benefits of our approach with a popular parallel 3D-FFT kernel, which has been re-designed to use the MPI_Ialltoallv operation. We observe that our proposed designs can improve the performance of the P3DFFT kernel by up to 27%, with 2,048 processes on the TACC Stampede system.

Keywords

application program interfaces; message passing; multiprocessing systems; parallel processing; CS thread; MPI communication runtimes; MPI forum; TACC Stampede system; communication servlet thread; copy operations; fast Fourier transforms; functional partitioning approach; high-performance MPI-3 non-blocking alltoallv collective; memory resources; message passing interface; multi-core systems; parallel 3D-FFT kernel; software-based functional partitioning solutions; specialized hardware support; Instruction sets; Kernel; Libraries; Memory management; Message systems; Peer-to-peer computing; Scalability; Computation/Communication Overlap; InfiniBand; MPI-3 Non-Blocking Collectives; Multi-Cores; Parallel 3D FFT;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel Processing (ICPP), 2013 42nd International Conference on

Conference_Location

Lyon

ISSN

0190-3918

Type

conf

DOI

10.1109/ICPP.2013.75

Filename

6687399

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=656196