Title :
Skewed checkpointing for tolerating multi-node failures
Author :
Nakamura, Hiroshi ; Hayashida, Takuro ; Kondo, Masaaki ; Tajima, Yuya ; Imai, Masashi ; Nanya, Takashi
Author_Institution :
Res. Center for Adv. Sci. & Technol., Tokyo Univ., Japan
Abstract :
Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated checkpointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation. Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed checkpointing ensures └log2N┘ degrees of redundancy when the number of nodes is N. In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.
Keywords :
checkpointing; distributed memory systems; fault tolerant computing; redundancy; workstation clusters; computing node; coordinated checkpointing; distributed memory system; high performance computing; large cluster system; multinode failure tolerance; redundancy; skewed checkpointing; Availability; Checkpointing; Costs; Degradation; Delay; Distributed computing; High performance computing; Performance analysis; Redundancy; Reliability engineering;
Conference_Titel :
Reliable Distributed Systems, 2004. Proceedings of the 23rd IEEE International Symposium on
Print_ISBN :
0-7695-2239-4
DOI :
10.1109/RELDIS.2004.1353012