DocumentCode :
2959849
Title :
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications
Author :
Guermouche, Amina ; Ropars, Thomas ; Snir, Marc ; Cappello, Franck
Author_Institution :
INRIA Saclay-Ile de France, Orsay, France
fYear :
2012
fDate :
21-25 May 2012
Firstpage :
1216
Lastpage :
1227
Abstract :
High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (check pointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated check pointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the send-deterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; parallel processing; protocols; HydEE; MPICH2 library; application messages; concurrent failures; failure checkpointing protocols; failure containment; failure mean time; global restart; high-performance computing; hybrid fault tolerant protocols; hybrid rollback-recovery protocol; inter-cluster messages; large-scale send-deterministic MPI applications; memory occupation; message logging; message passing applications; Checkpointing; Fault tolerance; Fault tolerant systems; Libraries; Message passing; Protocols; High performance computing; MPI; failure containment; fault tolerance; send-determinism;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-4673-0975-2
Type :
conf
DOI :
10.1109/IPDPS.2012.111
Filename :
6267924
Link To Document :
بازگشت