DocumentCode :
2888637
Title :
A reliability-aware approach for an optimal checkpoint/restart model in HPC environments
Author :
Liu, Yudan ; Nassar, Raja ; Leangsuksun, Chockchai Box ; Naksinehaboon, Nichamon ; Paun, Mihaela ; Scott, Stephen
Author_Institution :
Louisiana Tech Univ., Ruston, LA
fYear :
2007
fDate :
17-20 Sept. 2007
Firstpage :
452
Lastpage :
457
Abstract :
The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads. Our scheme aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.
Keywords :
checkpointing; parallel processing; software fault tolerance; statistical distributions; Poisson failure; fault tolerant mechanism; high performance computing; large-scale HPC system; optimal checkpoint-restart strategy; rollback minimisation; statistical distribution; system reliability; Computer science; Cost function; Data analysis; Educational institutions; Failure analysis; Fault tolerant systems; Large-scale systems; Mathematical model; Mathematics; Reliability engineering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing, 2007 IEEE International Conference on
Conference_Location :
Austin, TX
ISSN :
1552-5244
Print_ISBN :
978-1-4244-1387-4
Electronic_ISBN :
1552-5244
Type :
conf
DOI :
10.1109/CLUSTR.2007.4629264
Filename :
4629264
Link To Document :
بازگشت