• DocumentCode
    3145035
  • Title

    Preserving Collective Performance across Process Failure for a Fault Tolerant MPI

  • Author

    Hursey, Joshua ; Graham, Richard L.

  • Author_Institution
    Oak Ridge Nat. Lab., Oak Ridge, TN, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    1208
  • Lastpage
    1215
  • Abstract
    Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what traditional techniques alone can provide. Applications will depend on libraries to sustain failure-free performance across process failure to continue to efficiently use High Performance Computing (HPC) systems even in the presence of process failure. Optimized Message Passing Interface (MPI) collective operations are a critical component of many scalable HPC applications. However, most of the collective algorithms are not able to handle process failure. Next generation MPI implementations must provide fault aware versions of such algorithms that can sustain performance across process failure. This paper discusses the design and implementation of fault aware collective algorithms for tree structured communication patterns. The three design approaches of rerouting, lookup avoiding and rebalancing are described, and analyzed for their performance impact relative to a similar fault unaware collective algorithm. The analysis shows that the rerouting approach causes up to a four times performance degradation while the rebalancing approach can bring the performance within 1% of the fault unaware performance. Additionally, this paper introduces the reader to a set of run-through stabilization semantics being developed by the MPI Forum´s Fault Tolerance Working Group to support ABFT. This paper underscores the need for care to be taken when designing new fault aware collective algorithms for fault tolerant MPI implementations.
  • Keywords
    message passing; resource allocation; safety-critical software; software fault tolerance; software performance evaluation; table lookup; tree data structures; ABFT techniques; algorithm based fault tolerance; application developers; application recovery; collective operations; collective performance; failure-free performance; fault aware algorithms; fault aware collective algorithms; fault tolerant MPI; high performance computing systems; lookup avoiding; performance degradation; process failure; rebalancing; rerouting; tree structured communication patterns; Algorithm design and analysis; Fault tolerance; Fault tolerant systems; Optimization; Proposals; Prototypes; Semantics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
  • Conference_Location
    Shanghai
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-425-1
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.274
  • Filename
    6008971