Title :
Reliability Speedup: An Effective Metric for Parallel Application with Checkpointing
Author_Institution :
Nat. Lab. for Paralleling & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
With parallel computing system scaling up, the system reliability drastically decreases, so parallel applications running on such system must tolerate hardware failures. Checkpointing is widely used in the domain of large-scale parallel computing, which periodically saves the state of computation to stable storage. This produces in negligible fault tolerance overhead. The traditional speedup only measures the performance of failure-free system. In this paper, we firstly propose the speedup metric taking into account checkpointing overhead. The new metric unifies the performance and reliability measures, and evaluates the practical speedup of parallel application with checkpointing. Furthermore, this paper classifies and analyzes existing parallel systems according to the proposed speedup metric, and makes suggestions on system design and fault tolerance techniques improvement. Finally, we validate the analysis of this new speedup metric by experiment. The experimental results indicate that the proposed speedup for parallel application with checkpointing is an effective metric.
Keywords :
checkpointing; fault tolerant computing; parallel processing; performance evaluation; checkpointing overhead; failure-free system performance; fault tolerance overhead; large-scale parallel computing system; speedup metric; system design; system reliability; Checkpointing; Concurrent computing; Fault tolerance; Fault tolerant systems; Hardware; Large-scale systems; Parallel processing; Reliability; System analysis and design; Velocity measurement;
Conference_Titel :
Parallel and Distributed Computing, Applications and Technologies, 2009 International Conference on
Conference_Location :
Higashi Hiroshima
Print_ISBN :
978-0-7695-3914-0
DOI :
10.1109/PDCAT.2009.19