Title :
Avoiding checkpoint contamination in parallel systems
Author :
Silva, L.M. ; Silva, J.G.
Author_Institution :
Dept. de Engenharia Inf., Coimbra Univ., Portugal
Abstract :
Checkpointing and rollback recovery is a very effective technique to tolerate faults, provided the application is able to recover from a previous checkpoint and proceed with a failure-free computation. However, this technique may fall short if the checkpoint files are somehow contaminated by errors. This paper presents two mechanisms that may be used to determine if a committed checkpoint is error-free or not. These techniques can be used simultaneously for error detection and failure recovery. Both of them are based on checkpoint duplication: one makes use of spatial redundancy while the other is based on temporal redundancy. We discuss the main problems and trade-offs that have to be dealt with to implement these techniques. We then present a performance study that clearly shows the pros and cons of each one. As far as we know, this paper presents the first implementation of these mechanisms in a standard parallel computing system.
Keywords :
fault tolerant computing; parallel processing; redundancy; system recovery; checkpoint contamination avoidance; checkpoint duplication; checkpoint files; checkpointing; error detection; failure recovery; failure-free computation; parallel computing system; parallel systems; rollback recovery; spatial redundancy; temporal redundancy; Checkpointing; Contamination; Costs; Ear; Error correction; Fault detection; Fault tolerant systems; Hardware; Mechanical factors; Parallel processing;
Conference_Titel :
Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
Conference_Location :
Munich, Germany
Print_ISBN :
0-8186-8470-4
DOI :
10.1109/FTCS.1998.689487