• DocumentCode
    1392153
  • Title

    A New Diskless Checkpointing Approach for Multiple Processor Failures

  • Author

    Chiu, Ge-Ming ; Chiu, Jane-Ferng

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ. of Sci. & Technol., Taipei, Taiwan
  • Volume
    8
  • Issue
    4
  • fYear
    2011
  • Firstpage
    481
  • Lastpage
    493
  • Abstract
    Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.
  • Keywords
    distributed processing; fault tolerant computing; program processors; storage management; XOR operations; checkpoint storage nodes; diskless checkpointing approach; distributed computing systems; failure recovery operations; fault tolerance; multiple processor failures; parallel computing systems; Arrays; Checkpointing; Encoding; Mobile computing; Parallel processing; Peer to peer computing; Reed-Solomon codes; Diskless checkpointing; XOR.; multiple failures; rollback recovery;
  • fLanguage
    English
  • Journal_Title
    Dependable and Secure Computing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5971
  • Type

    jour

  • DOI
    10.1109/TDSC.2010.76
  • Filename
    5654515