DocumentCode :
3024728
Title :
Cloning-based checkpoint for localized recovery
Author :
Wei, Zunce ; Li, Hon F. ; Goswami, Dhrubajyoti
Author_Institution :
Dept. of Comput. Sci., Concordia Univ., Montreal, Que., Canada
fYear :
2005
fDate :
7-9 Dec. 2005
Abstract :
This paper studies the use of process clones towards localizing recovery in large-scale distributed systems. A clone is a virtual recovery process with a limited life, and is useful for decoupling recovery dependencies among checkpoints. A generic checkpoint dependency graph (CDG) model is used to capture the dependency relations among checkpoints. A Non-atomic Group Checkpoint (NGC) protocol is presented. It is proved that the protocol can result in localized recovery involving a single group when clones are employed. To limit recovery spread, the size of a group should be limited. This paper presents a few interesting results in this aspect: (i) there is no embedded protocol for atomic group formation with a bounded group-size (k-bounded protocol); (ii) a k-bounded atomic group checkpoint protocol requires at least m-1 explicit messages for checkpoint synchronization in a system consisting of m processes. Lastly, a simple k-bounded atomic group checkpoint protocol is presented and proved.
Keywords :
checkpointing; graph colouring; message passing; protocols; CDG model; NGC protocol; checkpoint dependency graph; cloning-based checkpoint; k-bounded atomic group checkpoint protocol; large-scale distributed systems; localized recovery; Cloning; Computer crashes; Computer science; Costs; Delay; Fault tolerant systems; Large-scale systems; Protocols; Runtime; System performance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Architectures,Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8th International Symposium on
ISSN :
1087-4089
Print_ISBN :
0-7695-2509-1
Type :
conf
DOI :
10.1109/ISPAN.2005.26
Filename :
1575823
Link To Document :
بازگشت