• DocumentCode
    1640745
  • Title

    Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

  • Author

    Naksinehaboon, N. ; Yudan Liu ; Leangsuksun, C. ; Nassar, R. ; Paun, M. ; Scott, S.L.

  • Author_Institution
    Coll. of Eng. & Sci., Louisiana Tech Univ., Ruston, LA
  • fYear
    2008
  • Firstpage
    783
  • Lastpage
    788
  • Abstract
    For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model (Liu et al., 2007) on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.
  • Keywords
    checkpointing; large-scale systems; software reliability; incremental checkpoint; incremental restart; large-scale HPC system; reliable storage; system checkpoint; system reliability; Checkpointing; Computer networks; Context modeling; Educational institutions; Grid computing; Large-scale systems; Mathematical model; Mathematics; Scheduling; USA Councils;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-0-7695-3156-4
  • Type

    conf

  • DOI
    10.1109/CCGRID.2008.109
  • Filename
    4534304