DocumentCode :
3633089
Title :
Characterization of consistent global checkpoints in large-scale distributed systems
Author :
R. Baldoni;J. Brzezinski;J.M. Helary;A. Mostefaoui;M. Raynal
Author_Institution :
IRISA, Rennes, France
fYear :
1995
Firstpage :
314
Lastpage :
323
Abstract :
Backward error recovery is one of the most used schemes to ensure fault-tolerance in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation in an error-free global state from which it can be resumed to produce a correct behaviour. Checkpointing is one of the techniques to pursue the backward error recovery. As we consider large-scale distributed systems, on one side a coordinated approach to take checkpoints is not practicable, on the other side for an uncoordinated approach the probability to have a domino effect during a recovery could be no longer negligible. In this paper, we present a framework that allows first to define formally the domino effect and second to state and prove a theorem to determine if an arbitrary set of check points is consistent. This theorem is very general as it considers a semantic including missing and orphan messages. This plays a key role in designing uncoordinated checkpointing algorithms that require to take as less additional checkpoints as possible in order to ensure domino-free recovery.
Keywords :
"Large-scale systems","Computer errors","Forward contracts","Checkpointing","Error correction codes","Distributed computing","Fault tolerance","Error correction","Algorithm design and analysis"
Publisher :
ieee
Conference_Titel :
Distributed Computing Systems, 1995., Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of
Print_ISBN :
0-8186-7125-4
Type :
conf
DOI :
10.1109/FTDCS.1995.525000
Filename :
525000
Link To Document :
بازگشت