مرکز منطقه ای اطلاع رساني علوم و فناوري - Performance under failures of high-end computing

DocumentCode :

505976

Title :

Performance under failures of high-end computing

Author :

Wu, Ming ; Sun, Xian-He ; Jin, Hui

Author_Institution :

Illinois Institute of Technology, Chicago, Illinois

fYear :

2007

fDate :

10-16 Nov. 2007

Firstpage :

Lastpage :

Abstract :

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults´ influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.

Keywords :

Application software; Checkpointing; Computational modeling; Costs; Fault tolerance; High performance computing; Large-scale systems; Predictive models; Production; Scheduling algorithm; application performance; failure modeling; fault-tolerance;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Supercomputing, 2007. SC '07. Proceedings of the 2007 ACM/IEEE Conference on

Conference_Location :

Reno, NV, USA

Print_ISBN :

978-1-59593-764-3

Electronic_ISBN :

978-1-59593-764-3

Type :

conf

DOI :

10.1145/1362622.1362687

Filename :

5348807

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=505976