DocumentCode
3175127
Title
Avoiding checkpoint contamination in parallel systems
Author
Silva, L.M. ; Silva, J.G.
Author_Institution
Dept. de Engenharia Inf., Coimbra Univ., Portugal
fYear
1998
fDate
23-25 June 1998
Firstpage
364
Lastpage
369
Abstract
Checkpointing and rollback recovery is a very effective technique to tolerate faults, provided the application is able to recover from a previous checkpoint and proceed with a failure-free computation. However, this technique may fall short if the checkpoint files are somehow contaminated by errors. This paper presents two mechanisms that may be used to determine if a committed checkpoint is error-free or not. These techniques can be used simultaneously for error detection and failure recovery. Both of them are based on checkpoint duplication: one makes use of spatial redundancy while the other is based on temporal redundancy. We discuss the main problems and trade-offs that have to be dealt with to implement these techniques. We then present a performance study that clearly shows the pros and cons of each one. As far as we know, this paper presents the first implementation of these mechanisms in a standard parallel computing system.
Keywords
fault tolerant computing; parallel processing; redundancy; system recovery; checkpoint contamination avoidance; checkpoint duplication; checkpoint files; checkpointing; error detection; failure recovery; failure-free computation; parallel computing system; parallel systems; rollback recovery; spatial redundancy; temporal redundancy; Checkpointing; Contamination; Costs; Ear; Error correction; Fault detection; Fault tolerant systems; Hardware; Mechanical factors; Parallel processing;
fLanguage
English
Publisher
ieee
Conference_Titel
Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
Conference_Location
Munich, Germany
ISSN
0731-3071
Print_ISBN
0-8186-8470-4
Type
conf
DOI
10.1109/FTCS.1998.689487
Filename
689487
Link To Document