• DocumentCode
    3633089
  • Title

    Characterization of consistent global checkpoints in large-scale distributed systems

  • Author

    R. Baldoni;J. Brzezinski;J.M. Helary;A. Mostefaoui;M. Raynal

  • Author_Institution
    IRISA, Rennes, France
  • fYear
    1995
  • Firstpage
    314
  • Lastpage
    323
  • Abstract
    Backward error recovery is one of the most used schemes to ensure fault-tolerance in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation in an error-free global state from which it can be resumed to produce a correct behaviour. Checkpointing is one of the techniques to pursue the backward error recovery. As we consider large-scale distributed systems, on one side a coordinated approach to take checkpoints is not practicable, on the other side for an uncoordinated approach the probability to have a domino effect during a recovery could be no longer negligible. In this paper, we present a framework that allows first to define formally the domino effect and second to state and prove a theorem to determine if an arbitrary set of check points is consistent. This theorem is very general as it considers a semantic including missing and orphan messages. This plays a key role in designing uncoordinated checkpointing algorithms that require to take as less additional checkpoints as possible in order to ensure domino-free recovery.
  • Keywords
    "Large-scale systems","Computer errors","Forward contracts","Checkpointing","Error correction codes","Distributed computing","Fault tolerance","Error correction","Algorithm design and analysis"
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems, 1995., Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of
  • Print_ISBN
    0-8186-7125-4
  • Type

    conf

  • DOI
    10.1109/FTDCS.1995.525000
  • Filename
    525000