DocumentCode :
2626709
Title :
Adaptive independent checkpointing for reducing rollback propagation
Author :
Xu, Jian ; Netzer, Robert H B
Author_Institution :
Dept. of Comput. Sci., Brown Univ., Providence, RI, USA
fYear :
1993
fDate :
1-4 Dec 1993
Firstpage :
754
Lastpage :
761
Abstract :
Independent checkpointing is a simple technique for providing fault tolerance in distributed systems. However, it can suffer from the domino effect, which causes the rollback of one process to potentially propagate to others. In this paper we present an adaptive checkpointing algorithm to practically eliminate rollback propagation for independent checkpointing. Our algorithm is based on proofs of the conditions necessary and sufficient for a checkpoint to belong to some consistent global checkpoint, previously an open question. We characterize these conditions with a generalization of Lamport´s happened-before relation called a zigzag path. Our algorithm tracks zigzag paths on-line and checkpoints when certain paths are detected. Experiments on an iPSC/860 hypercube show that our algorithm reduces the average rollback required to recover from any fault to less than one checkpoint interval per process, and checkpoints only 4% more often than traditional periodic checkpointing algorithms. We thus eliminate rollback propagation without the runtime overhead of coordinated checkpoints or other schemes that attempt to reduce rollback propagation
Keywords :
computer network reliability; fault tolerant computing; parallel algorithms; reliability; system recovery; adaptive checkpointing algorithm; coordinated checkpoints; distributed systems; domino effect; fault tolerance; happened-before relation; iPSC/860 hypercube; independent checkpointing; rollback propagation; runtime overhead; zigzag path; Checkpointing; Computer science; Fault detection; Fault tolerance; Fault tolerant systems; Forward contracts; Hypercubes; Runtime; Sufficient conditions; Vents;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing, 1993. Proceedings of the Fifth IEEE Symposium on
Conference_Location :
Dallas, TX
Print_ISBN :
0-8186-4222-X
Type :
conf
DOI :
10.1109/SPDP.1993.395456
Filename :
395456
Link To Document :
بازگشت