Title :
Simplifying the Recovery Model of User-Level Failure Mitigation
Author :
Bland, Wesley ; Raffenetti, Kenneth ; Balaji, Pavan
Author_Institution :
Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
Abstract :
As resilience research in high-performance computing has matured, so too have the tools, libraries, and languages that result from it. The Message Passing Interface (MPI) Forum is considering the addition of fault tolerance to a future version of the MPI standard, and a new chapter called User-Level Failure Mitigation (ULFM) has been proposed to fill this need. However, as ULFM usage has become more widespread, many potential users are concerned about its complexity and the need to rewrite existing codes. In this paper, we present a usage model that is similar to the usage already common in existing codes but that does not require the user to restart the application (thereby incurring the costs of re-entering the batch queue, startup costs, etc.). We use a new implementation of ULFM in MPICH, a popular open source MPI implementation, and demonstrate the ULFM usage using the Monte Carlo Communication Kernel, a proxy-app developed by the Center for Exascale Simulation of Advanced Reactors. Results show that the approach used incurs a level of intrusiveness into the code similar to that of existing checkpoint/restart models, but with less overhead.
Keywords :
Monte Carlo methods; application program interfaces; message passing; parallel processing; public domain software; system recovery; MPI forum; MPI standard; Monte Carlo Communication Kernel; ULFM; fault tolerance; high-performance computing; message passing interface; open source MPI implementation; recovery model; user level failure mitigation; Benchmark testing; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Resilience; Runtime;
Conference_Titel :
Exascale MPI at Supercomputing Conference (ExaMPI), 2014 Workshop on
Conference_Location :
New Orleans, LA
DOI :
10.1109/ExaMPI.2014.4