• DocumentCode
    3175127
  • Title

    Avoiding checkpoint contamination in parallel systems

  • Author

    Silva, L.M. ; Silva, J.G.

  • Author_Institution
    Dept. de Engenharia Inf., Coimbra Univ., Portugal
  • fYear
    1998
  • fDate
    23-25 June 1998
  • Firstpage
    364
  • Lastpage
    369
  • Abstract
    Checkpointing and rollback recovery is a very effective technique to tolerate faults, provided the application is able to recover from a previous checkpoint and proceed with a failure-free computation. However, this technique may fall short if the checkpoint files are somehow contaminated by errors. This paper presents two mechanisms that may be used to determine if a committed checkpoint is error-free or not. These techniques can be used simultaneously for error detection and failure recovery. Both of them are based on checkpoint duplication: one makes use of spatial redundancy while the other is based on temporal redundancy. We discuss the main problems and trade-offs that have to be dealt with to implement these techniques. We then present a performance study that clearly shows the pros and cons of each one. As far as we know, this paper presents the first implementation of these mechanisms in a standard parallel computing system.
  • Keywords
    fault tolerant computing; parallel processing; redundancy; system recovery; checkpoint contamination avoidance; checkpoint duplication; checkpoint files; checkpointing; error detection; failure recovery; failure-free computation; parallel computing system; parallel systems; rollback recovery; spatial redundancy; temporal redundancy; Checkpointing; Contamination; Costs; Ear; Error correction; Fault detection; Fault tolerant systems; Hardware; Mechanical factors; Parallel processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
  • Conference_Location
    Munich, Germany
  • ISSN
    0731-3071
  • Print_ISBN
    0-8186-8470-4
  • Type

    conf

  • DOI
    10.1109/FTCS.1998.689487
  • Filename
    689487