مرکز منطقه ای اطلاع رساني علوم و فناوري - Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

DocumentCode :

1191256

Title :

Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

Author :

Chen, Zizhong ; Dongarra, Jack

Author_Institution :

Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA

Volume :

Issue :

fYear :

2009

Firstpage :

1512

Lastpage :

1524

Abstract :

As the number of processors in today´s high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today´s architectures are usually robust enough to survive node failures without suffering complete system failure, most of today´s high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building self-healing high-performance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p]. k ((beta + 2gamma) m + alpha) to (1 + O (radic(p)/radic(m)))². k (beta + 2gamma)m, where alpha is the communication latency, 1/beta is the network bandwidth between processes, {1over gamma } is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to (1 + O (1/radic(m))). k (beta + 2gamma)m, which is independent of the total number of computational processors. The introduced self-healing algorithms are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that o- ur self-healing scheme can survive multiple simultaneous process failures with low-performance overhead and little numerical impact.

Keywords :

Reed-Solomon codes; application program interfaces; checkpointing; fault tolerant computing; message passing; parallel machines; FT-MPI; Reed-Solomon erasure codes; diskless checkpointing; floating-point numbers; high performance scientific computing; highly scalable self-healing algorithm; mean-time-to-failure; preconditioned conjugate gradient equation solver; Application software; Bandwidth; Checkpointing; Computer applications; Computer architecture; Delay; Encoding; Reed-Solomon codes; Robustness; Scientific computing; Message Passing Interface.; Self-healing; diskless checkpointing; fault tolerance; high-performance computing; parallel and distributed systems; pipeline;

fLanguage :

English

Journal_Title :

Computers, IEEE Transactions on

Publisher :

ieee

ISSN :

0018-9340

Type :

jour

DOI :

10.1109/TC.2009.42

Filename :

4799775

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1191256