DocumentCode
5570
Title
Complexity Analysis of Checkpoint Scheduling with Variable Costs
Author
Bouguerra, Mohamed-Slim ; Trystram, Denis ; Wagner, F.
Author_Institution
INRIA Rhone-Alpes Grenoble, Montbonnot St. Martin, France
Volume
62
Issue
6
fYear
2013
fDate
Jun-13
Firstpage
1269
Lastpage
1275
Abstract
The parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation, and network communication. This will certainly affect the global performance of the application. In this work, we propose a performance model that expresses formally the checkpoint scheduling problem. This model exhibits the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. Based on this model, we study the computational complexity of the problem of scheduling checkpoints with variable costs for general failure distributions. More precisely, we provide a new computational complexity analysis that explicits in depth the relations between the probabilistic failure model, the checkpoint cost, and the computational model. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem.
Keywords
checkpointing; computational complexity; dynamic programming; parallel processing; probability; scheduling; HPC parallel application; NP-hard problem; checkpoint cost; checkpoint operation; checkpoint scheduling; checkpointing time; complexity analysis; computational complexity; dynamic programming scheme; fault-tolerant application; high performance computing; parallel computing platform; probabilistic failure model; Checkpointing; Computational modeling; History; Maintenance engineering; Optimization; Processor scheduling; Program processors; Fault tolerance; checkpoint scheduling; failure detection;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/TC.2012.57
Filename
6165262
Link To Document