• DocumentCode
    2626709
  • Title

    Adaptive independent checkpointing for reducing rollback propagation

  • Author

    Xu, Jian ; Netzer, Robert H B

  • Author_Institution
    Dept. of Comput. Sci., Brown Univ., Providence, RI, USA
  • fYear
    1993
  • fDate
    1-4 Dec 1993
  • Firstpage
    754
  • Lastpage
    761
  • Abstract
    Independent checkpointing is a simple technique for providing fault tolerance in distributed systems. However, it can suffer from the domino effect, which causes the rollback of one process to potentially propagate to others. In this paper we present an adaptive checkpointing algorithm to practically eliminate rollback propagation for independent checkpointing. Our algorithm is based on proofs of the conditions necessary and sufficient for a checkpoint to belong to some consistent global checkpoint, previously an open question. We characterize these conditions with a generalization of Lamport´s happened-before relation called a zigzag path. Our algorithm tracks zigzag paths on-line and checkpoints when certain paths are detected. Experiments on an iPSC/860 hypercube show that our algorithm reduces the average rollback required to recover from any fault to less than one checkpoint interval per process, and checkpoints only 4% more often than traditional periodic checkpointing algorithms. We thus eliminate rollback propagation without the runtime overhead of coordinated checkpoints or other schemes that attempt to reduce rollback propagation
  • Keywords
    computer network reliability; fault tolerant computing; parallel algorithms; reliability; system recovery; adaptive checkpointing algorithm; coordinated checkpoints; distributed systems; domino effect; fault tolerance; happened-before relation; iPSC/860 hypercube; independent checkpointing; rollback propagation; runtime overhead; zigzag path; Checkpointing; Computer science; Fault detection; Fault tolerance; Fault tolerant systems; Forward contracts; Hypercubes; Runtime; Sufficient conditions; Vents;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing, 1993. Proceedings of the Fifth IEEE Symposium on
  • Conference_Location
    Dallas, TX
  • Print_ISBN
    0-8186-4222-X
  • Type

    conf

  • DOI
    10.1109/SPDP.1993.395456
  • Filename
    395456