• DocumentCode
    5570
  • Title

    Complexity Analysis of Checkpoint Scheduling with Variable Costs

  • Author

    Bouguerra, Mohamed-Slim ; Trystram, Denis ; Wagner, F.

  • Author_Institution
    INRIA Rhone-Alpes Grenoble, Montbonnot St. Martin, France
  • Volume
    62
  • Issue
    6
  • fYear
    2013
  • fDate
    Jun-13
  • Firstpage
    1269
  • Lastpage
    1275
  • Abstract
    The parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation, and network communication. This will certainly affect the global performance of the application. In this work, we propose a performance model that expresses formally the checkpoint scheduling problem. This model exhibits the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. Based on this model, we study the computational complexity of the problem of scheduling checkpoints with variable costs for general failure distributions. More precisely, we provide a new computational complexity analysis that explicits in depth the relations between the probabilistic failure model, the checkpoint cost, and the computational model. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem.
  • Keywords
    checkpointing; computational complexity; dynamic programming; parallel processing; probability; scheduling; HPC parallel application; NP-hard problem; checkpoint cost; checkpoint operation; checkpoint scheduling; checkpointing time; complexity analysis; computational complexity; dynamic programming scheme; fault-tolerant application; high performance computing; parallel computing platform; probabilistic failure model; Checkpointing; Computational modeling; History; Maintenance engineering; Optimization; Processor scheduling; Program processors; Fault tolerance; checkpoint scheduling; failure detection;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2012.57
  • Filename
    6165262