• DocumentCode
    1459611
  • Title

    On coordinated checkpointing in distributed systems

  • Author

    Cao, Guohong ; Singhal, Mukesh

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA
  • Volume
    9
  • Issue
    12
  • fYear
    1998
  • fDate
    12/1/1998 12:00:00 AM
  • Firstpage
    1213
  • Lastpage
    1225
  • Abstract
    Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems
  • Keywords
    distributed processing; synchronisation; system recovery; Prakash-Singhal algorithm; consistent global checkpoint; coordinated checkpointing; distributed computing systems; distributed systems; failure recovery; minimized checkpoints; minimized synchronization messages; nonblocking checkpointing process; overhead reduction; stable storage; Algorithm design and analysis; Checkpointing; Degradation; Distributed computing; Fault tolerant systems; Process control; Programming profession; System performance;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/71.737697
  • Filename
    737697