DocumentCode
2397229
Title
What is Missing in Current Checkpoint Interval Models?
Author
Fialho, Leonardo ; Rexachs, Dolores ; Luque, Emilio
Author_Institution
Dept. of Comput. Archit. & Oper. Syst., Univ. Autonoma of Barcelona, Barcelona, Spain
fYear
2011
fDate
20-24 June 2011
Firstpage
322
Lastpage
332
Abstract
The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on checkpoints. A common question surrounding check pointing is the definition of the checkpoint interval. In this paper we propose the modelling of the relationship established between the parallel applications processes due to the messages exchange in order to incorporate this relationship into current checkpoint interval models. The experimental evaluation shows that the use of our checkpoint interval model based on the definition of the parallel application inter-process dependency factor is effective to calculate the checkpoint interval for parallel applications. Our results demonstrate that the overhead prediction error is smaller than 4% in comparison with the application execution.
Keywords
checkpointing; parallel processing; software fault tolerance; checkpoint interval model; checkpointing; fault tolerance protocol; parallel application interprocess dependency factor; parallel computer fault frequency; Checkpointing; Computational modeling; Equations; Fault tolerance; Fault tolerant systems; Mathematical model; Protocols; checkpoint interval; fault tolerance; model; mpi; parallel applications;
fLanguage
English
Publisher
ieee
Conference_Titel
Distributed Computing Systems (ICDCS), 2011 31st International Conference on
Conference_Location
Minneapolis, MN
ISSN
1063-6927
Print_ISBN
978-1-61284-384-1
Electronic_ISBN
1063-6927
Type
conf
DOI
10.1109/ICDCS.2011.12
Filename
5961713
Link To Document