Title :
Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off
Author :
Liu, Yudan ; Leangsuksun, Chokchai Box ; Song, Hertong ; Scott, Stephen L.
Author_Institution :
Comput. Sci., Louisiana Tech Univ., Ruston, LA
Abstract :
In previous years, large scale clusters have been commonly deployed to solve important grand-challenge scientific problems. In order to reduce computational time, the system size has been increasingly expanded. Unfortunately, the reliability of such cluster systems goes in the opposite direction, as the extension of a system scale. Since failures of a single node could result in a system outage, it is essential to effectively deal with faulty situations in the grand challenge problem-solving environment. Checkpointing is one of common fault tolerance techniques. However, there are many challenges in checkpointing such as overhead, latency and consistency, as well as recovery. In this paper, a reliability-aware checkpoint/restart method was introduced. It is a novel technique to consider checkpointing placement based on system reliability. We constructed a cost model and derived an optimal checkpoint placement function based on failure rates: A trade-off between performance and reliability (i.e. performability) was a key consideration. We also implemented a proof-of-concept and demonstrated improvements resulting from our techniques for fault-tolerant MPI applications on an HA-OSCAR cluster
Keywords :
checkpointing; fault tolerant computing; message passing; software reliability; workstation clusters; HA-OSCAR cluster; cost model; fault-tolerant MPI applications; optimal checkpoint placement function; performability trade-off; reliability-aware checkpoint; reliability-aware restart; system reliability; Availability; Checkpointing; Contracts; Fault tolerance; Large-scale systems; Message passing; Open source software; Redundancy; Resilience; Runtime environment; Cluster Computing; Fault Tolerance; Massage Passing Interface; Reliability;
Conference_Titel :
Cluster Computing, 2005. IEEE International
Conference_Location :
Burlington, MA
Print_ISBN :
0-7803-9486-0
Electronic_ISBN :
1552-5244
DOI :
10.1109/CLUSTR.2005.347058