Title :
Checkpointing Message-Passing Interface (MPI) parallel programs
Author :
Li, Wei-Jih ; Tsay, Jyh-Jong
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Chung Cheng Univ., Chiayi, Taiwan
Abstract :
Many scientific problems can be distributed on a large number of processes to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques
Keywords :
message passing; parallel programming; program testing; software portability; Message Passing Interface; checkpointing; message-passing; parallel programs; parallel systems; Availability; Checkpointing; Communications technology; Computer science; Concurrent computing; Costs; Degradation; Message passing; Workstations; Writing;
Conference_Titel :
Fault-Tolerant Systems, 1997. Proceedings., Pacific Rim International Symposium on
Conference_Location :
Taipei
Print_ISBN :
0-8186-8212-4
DOI :
10.1109/PRFTS.1997.640140