DocumentCode :
1236762
Title :
Adaptive Fault Management of Parallel Applications for High-Performance Computing
Author :
Lan, Zhiling ; Li, Yawei
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
Volume :
57
Issue :
12
fYear :
2008
Firstpage :
1647
Lastpage :
1660
Abstract :
As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault management approach that combines proactive migration with reactive checkpointing. It aims to enable parallel applications to avoid anticipated failures via preventive migration and, in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed to make runtime decisions in response to failure prediction. Extensive experiments, by means of stochastic modeling and case studies with real applications, indicate that FT-Pro outperforms periodic checkpointing, in terms of reducing application completion times and improving resource utilization, by up to 43 percent.
Keywords :
checkpointing; parallel machines; performance evaluation; adaptive fault management; high-performance computing; periodic checkpointing; preventive migration; proactive migration; reactive checkpointing; resource utilization; stochastic modeling; Application software; Checkpointing; Computer Society; Computer applications; Concurrent computing; Power engineering computing; Resilience; Resource management; Runtime; Stochastic processes; Fault tolerance; Performance evaluation of algorithms and systems;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2008.90
Filename :
4531733
Link To Document :
بازگشت