DocumentCode :
625574
Title :
Adaptive Incremental Checkpointing via Delta Compression for Networked Multicore Systems
Author :
Jangjaimon, Itthichok ; Nian-Feng Tzeng
Author_Institution :
Center for Adv. Comput. Studies, Univ. of Louisiana at Lafayette, Lafayette, LA, USA
fYear :
2013
fDate :
20-24 May 2013
Firstpage :
7
Lastpage :
18
Abstract :
Checkpointing has been widely adopted in support of fault-tolerance and job migration, with checkpoint files preferably kept also at remote storage to withstand unavailability/failures of local nodes in networked systems. Lately, I/O bandwidth to remote storage becomes the bottleneck for checkpointing on a large-scale system. This paper proposes an adaptive incremental checkpointing (AIC), aiming to reduce the checkpointing file size considerably so that its involved overhead is lowered and thus the expected job turnaround time drops. Given production multicore systems are observed to have unused cores often available, we design AIC to make use of separate cores for carrying out multi-level checkpointing with delta compression at desirable points of time adaptively. We develop a new Markov model for predicting the performance of such multi-level concurrent checkpointing, with AIC performance evaluated using six SPEC benchmarks under various system sizes. AIC is observed to lower the normalized expected turnaround time substantially (by up to 47%) when compared to its static counterpart and a recent multi-level checkpointing scheme with fixed checkpoint intervals.
Keywords :
Markov processes; checkpointing; data compression; fault tolerant computing; multiprocessing systems; AIC performance; IO bandwidth; Markov model; SPEC benchmarks; adaptive incremental checkpointing; delta compression; fault-tolerance; fixed checkpoint intervals; job migration; large-scale system; multilevel checkpointing scheme; networked multicore systems; production multicore systems; remote storage; Bandwidth; Benchmark testing; Checkpointing; Markov processes; Multicore processing; Numerical models; Runtime; Adaptive checkpointing; Markov model; delta compression; fault tolerance; incremental checkpointing; multicore systems; two-level checkpointing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
Conference_Location :
Boston, MA
ISSN :
1530-2075
Print_ISBN :
978-1-4673-6066-1
Type :
conf
DOI :
10.1109/IPDPS.2013.33
Filename :
6569796
Link To Document :
بازگشت