• DocumentCode
    656196
  • Title

    A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems

  • Author

    Kandalla, Krishna ; Subramoni, Hari ; Tomko, Karen ; Pekurovsky, Dmitry ; Panda, Dhabaleswar K.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2013
  • fDate
    1-4 Oct. 2013
  • Firstpage
    611
  • Lastpage
    620
  • Abstract
    Non-blocking collectives have been recently standardized by the Message Passing Interface (MPI) Forum. However, intelligent designs offered by the MPI communication runtimes are likely to be the key factors that drive their adoption. While hardware based solutions for non-blocking collective operations have shown promise, they require specialized hardware support and currently have several performance and scalability limitations. Alternatively, researchers have proposed software-based, Functional Partitioning solutions for non-blocking collectives, that rely on spare cores in each node to progress non-blocking collectives. However, these designs also require additional memory resources, and involve expensive copy operations. Such factors limit the overall performance and scalability benefits associated with using non-blocking collectives in MPI. In this paper, we propose a high performance, shared-memory backed, user-level approach based on functional partitioning, to design MPI-3 non-blocking collectives. Our approach relies on using one ``Communication Servlet (CS)" thread per node to seamlessly execute the non-blocking collective operations on behalf of the application processes. Our design also eliminates the need for additional memory resources and expensive copy operations between the application processes and the CS. We demonstrate that our solution can deliver near-perfect computation/communication overlap with large message, dense collective operations, such as MPI_Ialltoallv, while using just one core per node. We also study the benefits of our approach with a popular parallel 3D-FFT kernel, which has been re-designed to use the MPI_Ialltoallv operation. We observe that our proposed designs can improve the performance of the P3DFFT kernel by up to 27%, with 2,048 processes on the TACC Stampede system.
  • Keywords
    application program interfaces; message passing; multiprocessing systems; parallel processing; CS thread; MPI communication runtimes; MPI forum; TACC Stampede system; communication servlet thread; copy operations; fast Fourier transforms; functional partitioning approach; high-performance MPI-3 non-blocking alltoallv collective; memory resources; message passing interface; multi-core systems; parallel 3D-FFT kernel; software-based functional partitioning solutions; specialized hardware support; Instruction sets; Kernel; Libraries; Memory management; Message systems; Peer-to-peer computing; Scalability; Computation/Communication Overlap; InfiniBand; MPI-3 Non-Blocking Collectives; Multi-Cores; Parallel 3D FFT;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2013 42nd International Conference on
  • Conference_Location
    Lyon
  • ISSN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2013.75
  • Filename
    6687399