DocumentCode :
2540410
Title :
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
Author :
Lemarinier, Pierre ; Bouteiller, Aurelien ; Herault, Thomas ; Krawezik, Geraud ; Cappello, Franck
Author_Institution :
LRI, Univ. de Paris Sud, Orsay, France
fYear :
2004
fDate :
20-23 Sept. 2004
Firstpage :
115
Lastpage :
124
Abstract :
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are: 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. We extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of This work is twofold: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; protocols; MPI library; MPICH-V fault tolerance; NAS benchmark; automatic fault detection; blocking control messages; checkpoint server; communication latency; coordinated checkpointing protocol; fault tolerant MPI; high performance cluster; high speed network; message logging protocols; message passing systems; performance penalty; pessimistic message logging; transparent fault detection; Checkpointing; Communication system control; Delay; Fault detection; Fault tolerance; High-speed networks; Libraries; Message passing; Protocols; Stress;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing, 2004 IEEE International Conference on
ISSN :
1552-5244
Print_ISBN :
0-7803-8694-9
Type :
conf
DOI :
10.1109/CLUSTR.2004.1392609
Filename :
1392609
Link To Document :
بازگشت