DocumentCode :
579922
Title :
Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model
Author :
Xu, Xinhai ; Lin, Yufei
Author_Institution :
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol. Changsha, Changsha, China
fYear :
2012
fDate :
3-5 Nov. 2012
Firstpage :
582
Lastpage :
587
Abstract :
Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the mainstream C/R methods are either based on Fail-Stop fault model or making the system(or program) do error detection before storing checkpoints, so they can ensure the correctness of every checkpoint. However, the faults occurring in the systems in real world are more accordant with the Byzantine fault model, and in order to pursue the higher practical performance, neither the system nor the program implements any fault detection mechanism. Consequently, there may be errors in the checkpoints. This paper studies the checkpoint selection problem that which checkpoint should be selected as the object of rolling back after system occurring failure, based on Byzantine fault model. We design a framework of checkpoint selection, and then, based on it, propose three checkpoint selection strategies: conservative strategy, aggressive strategy and statistical strategy. The simulation results show that: the conservative strategy shows its superiority when the error latent period is long, while the aggressive strategy behaves oppositely, the statistical strategy has a stable efficiency, only 50% more overhead compared to the ideal checkpoint selection when the checkpoint period is the half of mean time between faults.
Keywords :
system recovery; Byzantine fault model; C/R; Checkpoint/Restart methods; aggressive strategy; checkpoint selection; conservative strategy; error detection; fail stop fault model; fault recovery; reliability problem; statistical strategy; supercomputers; Accidents; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Probability density function; Transient analysis; Byzantine fault model; Checkpoint selection; Checkpoint/Restart; fault tolerance; statistics;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Communication Networks (CICN), 2012 Fourth International Conference on
Conference_Location :
Mathura
Print_ISBN :
978-1-4673-2981-1
Type :
conf
DOI :
10.1109/CICN.2012.59
Filename :
6375180
Link To Document :
بازگشت