• DocumentCode
    3321727
  • Title

    An adaptive checkpointing protocol to bound recovery time with message logging

  • Author

    Ssu, Kuo-Feng ; Yao, Bin ; Fuchs, W. Kent

  • Author_Institution
    Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
  • fYear
    1999
  • fDate
    1999
  • Firstpage
    244
  • Lastpage
    252
  • Abstract
    Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures. These solutions are often not applicable due to the lack of accurate data on the probability distribution of failures. Most current checkpoint libraries require application users to define a fixed time interval for checkpointing. The checkpoint interval usually implies the approximate maximum recovery time for single process applications. However, actual recovery time can be much smaller when message logging is used. Due to this faster recovery, checkpointing may be more frequent than needed and thus unnecessary execution overhead is introduced. In this paper, an adaptive checkpointing protocol is developed to accurately enforce the user-defined recovery time and to reduce excessive checkpoints. An adaptive protocol has been implemented and evaluated using a receiver-based message logging algorithm on wired and wireless mobile networks. The results show that the protocol precisely maintains the user-defined maximum recovery times for several traces with varying message exchange rates. The mechanism incurs lour overhead, avoids unnecessary checkpointing, and reduces failure free execution time
  • Keywords
    fault tolerant computing; system recovery; adaptive checkpointing; failure free execution; message logging; optimal checkpoint interval; recovery time; Application software; Checkpointing; Contracts; Exchange rates; Failure analysis; Libraries; Mathematical model; Probability distribution; Random processes; Wireless application protocol;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Reliable Distributed Systems, 1999. Proceedings of the 18th IEEE Symposium on
  • Conference_Location
    Lausanne
  • ISSN
    1060-9857
  • Print_ISBN
    0-7695-0290-3
  • Type

    conf

  • DOI
    10.1109/RELDIS.1999.805100
  • Filename
    805100