• DocumentCode
    2397229
  • Title

    What is Missing in Current Checkpoint Interval Models?

  • Author

    Fialho, Leonardo ; Rexachs, Dolores ; Luque, Emilio

  • Author_Institution
    Dept. of Comput. Archit. & Oper. Syst., Univ. Autonoma of Barcelona, Barcelona, Spain
  • fYear
    2011
  • fDate
    20-24 June 2011
  • Firstpage
    322
  • Lastpage
    332
  • Abstract
    The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on checkpoints. A common question surrounding check pointing is the definition of the checkpoint interval. In this paper we propose the modelling of the relationship established between the parallel applications processes due to the messages exchange in order to incorporate this relationship into current checkpoint interval models. The experimental evaluation shows that the use of our checkpoint interval model based on the definition of the parallel application inter-process dependency factor is effective to calculate the checkpoint interval for parallel applications. Our results demonstrate that the overhead prediction error is smaller than 4% in comparison with the application execution.
  • Keywords
    checkpointing; parallel processing; software fault tolerance; checkpoint interval model; checkpointing; fault tolerance protocol; parallel application interprocess dependency factor; parallel computer fault frequency; Checkpointing; Computational modeling; Equations; Fault tolerance; Fault tolerant systems; Mathematical model; Protocols; checkpoint interval; fault tolerance; model; mpi; parallel applications;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems (ICDCS), 2011 31st International Conference on
  • Conference_Location
    Minneapolis, MN
  • ISSN
    1063-6927
  • Print_ISBN
    978-1-61284-384-1
  • Electronic_ISBN
    1063-6927
  • Type

    conf

  • DOI
    10.1109/ICDCS.2011.12
  • Filename
    5961713