Author_Institution :
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol. Changsha, Changsha, China
Abstract :
Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the mainstream C/R methods are either based on Fail-Stop fault model or making the system(or program) do error detection before storing checkpoints, so they can ensure the correctness of every checkpoint. However, the faults occurring in the systems in real world are more accordant with the Byzantine fault model, and in order to pursue the higher practical performance, neither the system nor the program implements any fault detection mechanism. Consequently, there may be errors in the checkpoints. This paper studies the checkpoint selection problem that which checkpoint should be selected as the object of rolling back after system occurring failure, based on Byzantine fault model. We design a framework of checkpoint selection, and then, based on it, propose three checkpoint selection strategies: conservative strategy, aggressive strategy and statistical strategy. The simulation results show that: the conservative strategy shows its superiority when the error latent period is long, while the aggressive strategy behaves oppositely, the statistical strategy has a stable efficiency, only 50% more overhead compared to the ideal checkpoint selection when the checkpoint period is the half of mean time between faults.
Keywords :
system recovery; Byzantine fault model; C/R; Checkpoint/Restart methods; aggressive strategy; checkpoint selection; conservative strategy; error detection; fail stop fault model; fault recovery; reliability problem; statistical strategy; supercomputers; Accidents; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Probability density function; Transient analysis; Byzantine fault model; Checkpoint selection; Checkpoint/Restart; fault tolerance; statistics;