مرکز منطقه ای اطلاع رساني علوم و فناوري - Checkpoint/restart in practice: When ‘simple is better’

DocumentCode :

166694

Title :

Checkpoint/restart in practice: When ‘simple is better’

Author :

El-Sayed, Nosayba ; Schroeder, Bianca

Author_Institution :

Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada

fYear :

2014

fDate :

22-26 Sept. 2014

Firstpage :

Lastpage :

Abstract :

Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. Despite the prevalence of checkpoint/restart, it is still not very well understood in practice how to set its key parameter, the checkpoint interval. Despite a large body of theoretical work, practitioners still rely on crude rules-of-thumb such as “checkpoint once every hour”. Our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. In particular, our paper makes the following contributions: We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance. We also find that more complex solutions only negligibly improve performance based on real world traces. We show that simple back-of-the envelope formulas can be used to accurately estimate the wasted work in HPC systems, and make projections of future HPC systems and requirements for their efficient use.

Keywords :

checkpointing; fault tolerant computing; parallel processing; system monitoring; HPC systems; back-of-the envelope formulas; checkpoint interval; checkpoint-restart; checkpointing process; closed-form solution; failure logs; fault tolerance; high-performance computing; Checkpointing; Closed-form solutions; Estimation; Fault tolerance; Optimized production technology; Parameter estimation; Software; Checkpoint-restart; Fault tolerance; High-performance computing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing (CLUSTER), 2014 IEEE International Conference on

Conference_Location :

Madrid

Type :

conf

DOI :

10.1109/CLUSTER.2014.6968777

Filename :

6968777

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=166694