DocumentCode :
3090125
Title :
Reducing Application-level Checkpoint File Sizes: Towards Scalable Fault Tolerance Solutions
Author :
Cores, Iván ; Rodríguez, Gabriel ; Martín, María J. ; Gonz´lez, P.
Author_Institution :
Comput. Archit. Group, Univ. of A Coruna, A Coruna, Spain
fYear :
2012
fDate :
10-13 July 2012
Firstpage :
371
Lastpage :
378
Abstract :
Systems intended for the execution of long-running parallel applications require fault tolerant capabilities, since the probability of failure increases with the execution time and the number of nodes. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support. However, in order to be useful for large scale systems, current checkpoint-recovery techniques should tackle the problem of reducing checkpointing cost. This paper addresses this issue through the reduction of the checkpoint file sizes. Different solutions to reduce the size of the checkpoints generated at application level are proposed and implemented in a checkpointing tool. Detailed experimental results on two multicore clusters show the effectiveness of the proposed methods.
Keywords :
checkpointing; multiprocessing programs; parallel programming; probability; software fault tolerance; application-level checkpoint file sizes; failure probability; fault tolerance solutions; multicore clusters; parallel applications; Arrays; Checkpointing; Fault tolerance; Fault tolerant systems; Libraries; Multicore processing; Optimization; Checkpointing; Fault Tolerance; MPI; Parallel Programming;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on
Conference_Location :
Leganes
Print_ISBN :
978-1-4673-1631-6
Type :
conf
DOI :
10.1109/ISPA.2012.55
Filename :
6280315
Link To Document :
بازگشت