Title :
Checkpoint/restart in practice: When ‘simple is better’
Author :
El-Sayed, Nosayba ; Schroeder, Bianca
Author_Institution :
Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada
Abstract :
Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. Despite the prevalence of checkpoint/restart, it is still not very well understood in practice how to set its key parameter, the checkpoint interval. Despite a large body of theoretical work, practitioners still rely on crude rules-of-thumb such as “checkpoint once every hour”. Our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. In particular, our paper makes the following contributions: We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance. We also find that more complex solutions only negligibly improve performance based on real world traces. We show that simple back-of-the envelope formulas can be used to accurately estimate the wasted work in HPC systems, and make projections of future HPC systems and requirements for their efficient use.
Keywords :
checkpointing; fault tolerant computing; parallel processing; system monitoring; HPC systems; back-of-the envelope formulas; checkpoint interval; checkpoint-restart; checkpointing process; closed-form solution; failure logs; fault tolerance; high-performance computing; Checkpointing; Closed-form solutions; Estimation; Fault tolerance; Optimized production technology; Parameter estimation; Software; Checkpoint-restart; Fault tolerance; High-performance computing;
Conference_Titel :
Cluster Computing (CLUSTER), 2014 IEEE International Conference on
Conference_Location :
Madrid
DOI :
10.1109/CLUSTER.2014.6968777