DocumentCode :
1392153
Title :
A New Diskless Checkpointing Approach for Multiple Processor Failures
Author :
Chiu, Ge-Ming ; Chiu, Jane-Ferng
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ. of Sci. & Technol., Taipei, Taiwan
Volume :
8
Issue :
4
fYear :
2011
Firstpage :
481
Lastpage :
493
Abstract :
Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.
Keywords :
distributed processing; fault tolerant computing; program processors; storage management; XOR operations; checkpoint storage nodes; diskless checkpointing approach; distributed computing systems; failure recovery operations; fault tolerance; multiple processor failures; parallel computing systems; Arrays; Checkpointing; Encoding; Mobile computing; Parallel processing; Peer to peer computing; Reed-Solomon codes; Diskless checkpointing; XOR.; multiple failures; rollback recovery;
fLanguage :
English
Journal_Title :
Dependable and Secure Computing, IEEE Transactions on
Publisher :
ieee
ISSN :
1545-5971
Type :
jour
DOI :
10.1109/TDSC.2010.76
Filename :
5654515
Link To Document :
بازگشت