DocumentCode :
1829326
Title :
Double Mutual-Aid Checkpointing for Fast Recovery
Author :
Chiu, Jane-Ferng
Author_Institution :
Dept. of Inf. Technol. & Commun., Tungnan Univ., New Taipei, Taiwan
fYear :
2012
fDate :
25-27 June 2012
Firstpage :
1015
Lastpage :
1020
Abstract :
Because of the enlarging system size and the increasing number of processors, the probability of errors and multiple simultaneously failures become the norm rather than the exception. Therefore, to tolerate multiple failures is indispensable. Normally, most diskless checkpointing need the maximum recovery overhead no mater how many failures happen at the same time. However, a small number of processors´ failures happen more frequently than the worse case. This study resolves the dilemma between more fault tolerance and fast recovery by presenting a novel diskless checkpointing which makes use of double mutual-aid checkpoints. It not only gives the necessary and sufficient condition but also proposes a method for determination the setting of double mutual-aid checkpoints.
Keywords :
checkpointing; errors; fault tolerant computing; probability; diskless checkpointing; double mutual-aid checkpointing; error probability; fast recovery; maximum recovery overhead; multiple failure tolerance; necessary and sufficient condition; processor failures; system size enlargement; Checkpointing; Encoding; Lead; Memory management; Program processors; Reliability; Sufficient conditions; diskless checkpointing; fast recovery; tolerate multiple failures;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on
Conference_Location :
Liverpool
Print_ISBN :
978-1-4673-2164-8
Type :
conf
DOI :
10.1109/HPCC.2012.148
Filename :
6332284
Link To Document :
بازگشت