DocumentCode :
560171
Title :
Checkpointing strategies for parallel jobs
Author :
Bougeret, Marin ; Casanova, Henri ; Rabie, Mikael ; Robert, Yves ; Vivien, Frédéric
Author_Institution :
ENS Lyon, Lyon, France
fYear :
2011
fDate :
12-18 Nov. 2011
Firstpage :
1
Lastpage :
11
Abstract :
This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the first rigorous proof that periodic checkpointing is optimal. For non-exponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of work completed before the next failure, which provides a good heuristic for minimizing the expected execution time. Our work considers various models of job parallelism and of parallel checkpointing overhead. We first perform extensive simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems. The obtained results not only corroborate our theoretical findings, but also show that our dynamic programming algorithm significantly outperforms previously proposed solutions in the case of Weibull failures. We then discuss results from simulation experiments that use failure logs from production clusters. These results confirm that our dynamic programming algorithm significantly outperforms existing solutions for real-world clusters.
Keywords :
Weibull distribution; checkpointing; dynamic programming; exponential distribution; parallel processing; software fault tolerance; Weibull distributions; checkpointing strategies; dynamic programming algorithm; expected job execution time minimization; exponential distributions; exponentially distributed failure inter-arrival times; job parallelism; nonexponentially distributed failures; parallel checkpointing overhead; parallel jobs; processor failures; production clusters; sequential jobs; Checkpointing; Dynamic programming; Exponential distribution; Heuristic algorithms; Programming; Shape; Weibull distribution; Fault-tolerance; checkpointing; parallel job; sequential job;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Conference_Location :
Seatle, WA
Electronic_ISBN :
978-1-4503-0771-0
Type :
conf
Filename :
6114437
Link To Document :
بازگشت