Title :
A New Diskless Checkpointing Approach for Multiple Processor Failures
Author :
Chiu, Ge-Ming ; Chiu, Jane-Ferng
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ. of Sci. & Technol., Taipei, Taiwan
Abstract :
Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.
Keywords :
distributed processing; fault tolerant computing; program processors; storage management; XOR operations; checkpoint storage nodes; diskless checkpointing approach; distributed computing systems; failure recovery operations; fault tolerance; multiple processor failures; parallel computing systems; Arrays; Checkpointing; Encoding; Mobile computing; Parallel processing; Peer to peer computing; Reed-Solomon codes; Diskless checkpointing; XOR.; multiple failures; rollback recovery;
Journal_Title :
Dependable and Secure Computing, IEEE Transactions on
DOI :
10.1109/TDSC.2010.76