Title :
Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart
Author :
Ziming Zheng ; Li Yu ; Zhiling Lan
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
Abstract :
Speedup models are powerful analytical tools for evaluating and predicting the performance of parallel applications. Unfortunately, the well-known speedup models like Amdahl´s law and Gustafson´s law do not take reliability into consideration and therefore cannot accurately account for application performance in the presence of failures. In this study, we enhance Amdahl´s law and Gustafson´s law by considering the impact of failures and the effect of coordinated checkpointing/restart. Unlike existing analytical studies relying on Exponential failure distribution alone, in this work we consider both Exponential and Weibull failure distributions in the construction of our reliability-aware speedup models. The derived reliability-aware models are validated through trace-based simulations under a variety of parameter settings. Our trace-based simulations demonstrate these models can effectively quantify failure impact on application speedup. Moreover, we present two case studies to illustrate the use of these reliability-aware speedup models.
Keywords :
Weibull distribution; checkpointing; exponential distribution; parallel processing; Weibull failure distributions; coordinated checkpointing/restart; exponential failure distribution; parallel applications; reliability-aware speedup models; trace-based simulations; Analytical models; Checkpointing; Computational modeling; Exponential distribution; Mathematical model; Reliability; Weibull distribution; Amdahl???s law; Gustafson???s law; Speedup; analytical modeling; reliability;
Journal_Title :
Computers, IEEE Transactions on
DOI :
10.1109/TC.2014.2317182