DocumentCode
579922
Title
Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model
Author
Xu, Xinhai ; Lin, Yufei
Author_Institution
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol. Changsha, Changsha, China
fYear
2012
fDate
3-5 Nov. 2012
Firstpage
582
Lastpage
587
Abstract
Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the mainstream C/R methods are either based on Fail-Stop fault model or making the system(or program) do error detection before storing checkpoints, so they can ensure the correctness of every checkpoint. However, the faults occurring in the systems in real world are more accordant with the Byzantine fault model, and in order to pursue the higher practical performance, neither the system nor the program implements any fault detection mechanism. Consequently, there may be errors in the checkpoints. This paper studies the checkpoint selection problem that which checkpoint should be selected as the object of rolling back after system occurring failure, based on Byzantine fault model. We design a framework of checkpoint selection, and then, based on it, propose three checkpoint selection strategies: conservative strategy, aggressive strategy and statistical strategy. The simulation results show that: the conservative strategy shows its superiority when the error latent period is long, while the aggressive strategy behaves oppositely, the statistical strategy has a stable efficiency, only 50% more overhead compared to the ideal checkpoint selection when the checkpoint period is the half of mean time between faults.
Keywords
system recovery; Byzantine fault model; C/R; Checkpoint/Restart methods; aggressive strategy; checkpoint selection; conservative strategy; error detection; fail stop fault model; fault recovery; reliability problem; statistical strategy; supercomputers; Accidents; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Probability density function; Transient analysis; Byzantine fault model; Checkpoint selection; Checkpoint/Restart; fault tolerance; statistics;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence and Communication Networks (CICN), 2012 Fourth International Conference on
Conference_Location
Mathura
Print_ISBN
978-1-4673-2981-1
Type
conf
DOI
10.1109/CICN.2012.59
Filename
6375180
Link To Document