DocumentCode :
505976
Title :
Performance under failures of high-end computing
Author :
Wu, Ming ; Sun, Xian-He ; Jin, Hui
Author_Institution :
Illinois Institute of Technology, Chicago, Illinois
fYear :
2007
fDate :
10-16 Nov. 2007
Firstpage :
1
Lastpage :
11
Abstract :
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults´ influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.
Keywords :
Application software; Checkpointing; Computational modeling; Costs; Fault tolerance; High performance computing; Large-scale systems; Predictive models; Production; Scheduling algorithm; application performance; failure modeling; fault-tolerance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Supercomputing, 2007. SC '07. Proceedings of the 2007 ACM/IEEE Conference on
Conference_Location :
Reno, NV, USA
Print_ISBN :
978-1-59593-764-3
Electronic_ISBN :
978-1-59593-764-3
Type :
conf
DOI :
10.1145/1362622.1362687
Filename :
5348807
Link To Document :
بازگشت