DocumentCode :
2962988
Title :
Reliability Speedup: An Effective Metric for Parallel Application with Checkpointing
Author :
Wang, Zhiyuan
Author_Institution :
Nat. Lab. for Paralleling & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
fYear :
2009
fDate :
8-11 Dec. 2009
Firstpage :
247
Lastpage :
254
Abstract :
With parallel computing system scaling up, the system reliability drastically decreases, so parallel applications running on such system must tolerate hardware failures. Checkpointing is widely used in the domain of large-scale parallel computing, which periodically saves the state of computation to stable storage. This produces in negligible fault tolerance overhead. The traditional speedup only measures the performance of failure-free system. In this paper, we firstly propose the speedup metric taking into account checkpointing overhead. The new metric unifies the performance and reliability measures, and evaluates the practical speedup of parallel application with checkpointing. Furthermore, this paper classifies and analyzes existing parallel systems according to the proposed speedup metric, and makes suggestions on system design and fault tolerance techniques improvement. Finally, we validate the analysis of this new speedup metric by experiment. The experimental results indicate that the proposed speedup for parallel application with checkpointing is an effective metric.
Keywords :
checkpointing; fault tolerant computing; parallel processing; performance evaluation; checkpointing overhead; failure-free system performance; fault tolerance overhead; large-scale parallel computing system; speedup metric; system design; system reliability; Checkpointing; Concurrent computing; Fault tolerance; Fault tolerant systems; Hardware; Large-scale systems; Parallel processing; Reliability; System analysis and design; Velocity measurement;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Computing, Applications and Technologies, 2009 International Conference on
Conference_Location :
Higashi Hiroshima
Print_ISBN :
978-0-7695-3914-0
Type :
conf
DOI :
10.1109/PDCAT.2009.19
Filename :
5372794
Link To Document :
بازگشت