Title :
Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation
Author :
Bressoud, Thomas C. ; Kozuch, Michael A.
Author_Institution :
Dept. of Math. & Comput. Sci., Denison Univ., Granville, TX, USA
fDate :
Aug. 31 2009-Sept. 4 2009
Abstract :
Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity namely MapReduce and Dryad, with runtime systems providing their own re-execute based fault tolerance mechanisms, but with no analysis of their failure characteristics. Another development is the availability of failure data spanning years for systems of significant size at Los Alamos National Labs (LANL), but the time between failure (TBF) for these systems is a poor fit to the exponential distribution assumed by optimization work in checkpointing, bringing these results into question. The work in this paper describes a discrete event simulation driven by the LANL data and by models of parallel checkpointing and MapReduce tasks. The simulation allows us to then evaluate and assess the fault tolerance characteristics of these tasks with the goal of minimizing the expected running time of a parallel program in a cluster in the presence of faults for both fault tolerance models.
Keywords :
checkpointing; discrete event simulation; exponential distribution; failure analysis; fault tolerant computing; optimisation; Dryad application; LANL data; Los Alamos National Labs; MapReduce application; cluster computing; cluster fault tolerance; discrete event simulation; expected running time minimisation; experimental evaluation; exponential distribution; failure data availability; parallel checkpointing; re-execute based fault tolerance mechanism; runtime system; time between failure; Application software; Checkpointing; Computational modeling; Computer science; Computer simulation; Failure analysis; Fault tolerance; Fault tolerant systems; Mathematics; Parallel programming;
Conference_Titel :
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4244-5011-4
Electronic_ISBN :
1552-5244
DOI :
10.1109/CLUSTR.2009.5289185