DocumentCode :
1992381
Title :
Tolerating Temporal Correlated Failures from Cyclic Dependency in High Performance Computing Systems
Author :
Chen, Xin ; He, Xubin
Author_Institution :
Dept. of Electr. & Comput. Eng., Tennessee Technol. Univ., Cookeville, TN, USA
fYear :
2008
fDate :
8-10 Dec. 2008
Firstpage :
509
Lastpage :
516
Abstract :
Correlated failures have recently gained more attention in the research of failures in large scale systems. Recent studies have pointed out the negative effect of ignoring such failures when designing a fault tolerant scheme for large scale systems. In this paper, we explore the behaviors of temporal correlated failures arising from cyclic dependency among task nodes via an abstract model. Using this model, we find that fast failure propagation and slow recovery from failures are two dominant factors which make recovering from such failures much difficult. To efficiently stop failure propagation and shorten the total recovering time, we propose a recovery protocol called GCCTS (group-based coordinated checkpointing and task suspending) against temporal correlated failures.
Keywords :
distributed processing; fault tolerant computing; cyclic dependency; fault tolerant scheme; group-based coordinated checkpointing; high performance computing systems; large scale systems; recovery protocol; task suspending; temporal correlated failure tolerance; Application software; Availability; Checkpointing; Concurrent computing; Fault tolerant systems; Hardware; Helium; High performance computing; Large-scale systems; Protocols; Correlated failures; availability; checkpointing; dependency; failure propagation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on
Conference_Location :
Melbourne, VIC
ISSN :
1521-9097
Print_ISBN :
978-0-7695-3434-3
Type :
conf
DOI :
10.1109/ICPADS.2008.24
Filename :
4724359
Link To Document :
بازگشت