• DocumentCode
    236607
  • Title

    Simplifying the Recovery Model of User-Level Failure Mitigation

  • Author

    Bland, Wesley ; Raffenetti, Kenneth ; Balaji, Pavan

  • Author_Institution
    Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
  • fYear
    2014
  • fDate
    17-17 Nov. 2014
  • Firstpage
    20
  • Lastpage
    25
  • Abstract
    As resilience research in high-performance computing has matured, so too have the tools, libraries, and languages that result from it. The Message Passing Interface (MPI) Forum is considering the addition of fault tolerance to a future version of the MPI standard, and a new chapter called User-Level Failure Mitigation (ULFM) has been proposed to fill this need. However, as ULFM usage has become more widespread, many potential users are concerned about its complexity and the need to rewrite existing codes. In this paper, we present a usage model that is similar to the usage already common in existing codes but that does not require the user to restart the application (thereby incurring the costs of re-entering the batch queue, startup costs, etc.). We use a new implementation of ULFM in MPICH, a popular open source MPI implementation, and demonstrate the ULFM usage using the Monte Carlo Communication Kernel, a proxy-app developed by the Center for Exascale Simulation of Advanced Reactors. Results show that the approach used incurs a level of intrusiveness into the code similar to that of existing checkpoint/restart models, but with less overhead.
  • Keywords
    Monte Carlo methods; application program interfaces; message passing; parallel processing; public domain software; system recovery; MPI forum; MPI standard; Monte Carlo Communication Kernel; ULFM; fault tolerance; high-performance computing; message passing interface; open source MPI implementation; recovery model; user level failure mitigation; Benchmark testing; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Resilience; Runtime;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Exascale MPI at Supercomputing Conference (ExaMPI), 2014 Workshop on
  • Conference_Location
    New Orleans, LA
  • Type

    conf

  • DOI
    10.1109/ExaMPI.2014.4
  • Filename
    7018164