DocumentCode :
2791790
Title :
Implementing and Evaluating Automatic Checkpointing
Author :
Martins, Antonio S., Jr. ; Gonçalves, Ronaldo A L
Author_Institution :
Data Process. Center, State Univ. of Maringa, Colombo
fYear :
2007
fDate :
26-30 March 2007
Firstpage :
1
Lastpage :
8
Abstract :
As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide this facility, a checkpoint mechanism is used to recover a failed parallel application rolling it back to an execution moment prior to occurrence of the failure. In this work we present a mechanism for managing checkpoint operations during the failures automatically. This mechanism records periodically the application´s context, identifies failed nodes and restarts MPI processes on the remaining nodes, allowing the continuity of the application and taking advantage of the computing accomplished previously. We describe a lot of changes inside source of the LAM/MPI. Experiments with an application for recognizing DNA similarity showed that despite the overhead caused by periodic checkpoints, the benefits can reach about 50% on a small cluster.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; parallel programming; MPI processes; automatic checkpointing; computer clusters; fault tolerance; message passing interface; Application software; Checkpointing; DNA; Fault tolerance; File systems; Image analysis; Image sequence analysis; Operating systems; Parallel processing; Pattern analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
Conference_Location :
Long Beach, CA
Print_ISBN :
1-4244-0910-1
Electronic_ISBN :
1-4244-0910-1
Type :
conf
DOI :
10.1109/IPDPS.2007.370557
Filename :
4228285
Link To Document :
بازگشت