Title :
Tolerating Temporal Correlated Failures from Cyclic Dependency in High Performance Computing Systems
Author :
Chen, Xin ; He, Xubin
Author_Institution :
Dept. of Electr. & Comput. Eng., Tennessee Technol. Univ., Cookeville, TN, USA
Abstract :
Correlated failures have recently gained more attention in the research of failures in large scale systems. Recent studies have pointed out the negative effect of ignoring such failures when designing a fault tolerant scheme for large scale systems. In this paper, we explore the behaviors of temporal correlated failures arising from cyclic dependency among task nodes via an abstract model. Using this model, we find that fast failure propagation and slow recovery from failures are two dominant factors which make recovering from such failures much difficult. To efficiently stop failure propagation and shorten the total recovering time, we propose a recovery protocol called GCCTS (group-based coordinated checkpointing and task suspending) against temporal correlated failures.
Keywords :
distributed processing; fault tolerant computing; cyclic dependency; fault tolerant scheme; group-based coordinated checkpointing; high performance computing systems; large scale systems; recovery protocol; task suspending; temporal correlated failure tolerance; Application software; Availability; Checkpointing; Concurrent computing; Fault tolerant systems; Hardware; Helium; High performance computing; Large-scale systems; Protocols; Correlated failures; availability; checkpointing; dependency; failure propagation;
Conference_Titel :
Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on
Conference_Location :
Melbourne, VIC
Print_ISBN :
978-0-7695-3434-3
DOI :
10.1109/ICPADS.2008.24