Title :
Quantifying rollback propagation in distributed checkpointing
Author :
Agbaria, Adnan ; Attiya, Hagit ; Friedman, Roy ; Vitenberg, Roman
Author_Institution :
Dept. of Comput. Sci., Technion-Israel Inst. of Technol., Haifa, Israel
Abstract :
Proposes a new classification of executions with checkpoints that is based on the notion of k-rollback, indicating the maximal number of checkpoints that may need to be rolled back during recovery. The relation between known execution classes is explored, and it is shown that coordinated checkpointing, SZPF (strictly Z-path free) and ZPF (Z-path free) are 1-rollback mechanisms, while ZCF (Z-cycle free) is (n-1)-rollback, where n is the number of participants in an execution. A new class of executions, called d-BC (d-bounded cycles), is introduced, and is shown to be an [(n-1)·d]-rollback mechanism (ZCF is a special case of d-BC for d=1). Finally, a d-BC protocol is presented. This protocol has the nice property that it does not impose any control information overhead on an application´s messages, yet it only sends a few control messages of its own. Moreover, the protocol maintains information about recovery lines, which enables very efficient discovery of the most recent recovery line that existed a short time before the failure
Keywords :
distributed algorithms; fault tolerant computing; protocols; system recovery; SZPF; Z-cycle free class; Z-path free class; ZCF; ZPF; application messages; control information overhead; control messages; coordinated checkpointing; d-BC protocol; d-bounded cycles; distributed checkpointing; execution classes; execution classification; k-rollback; recovery line information; rollback propagation; system failure; system recovery; Application software; Checkpointing; Computer science; Distributed computing; Electronic mail; Fault tolerant systems; Information retrieval; Protocols; Software debugging;
Conference_Titel :
Reliable Distributed Systems, 2001. Proceedings. 20th IEEE Symposium on
Conference_Location :
New Orleans, LA
Print_ISBN :
0-7695-1366-2
DOI :
10.1109/RELDIS.2001.969737