مرکز منطقه ای اطلاع رساني علوم و فناوري - Building algorithmically nonstop fault tolerant MPI programs

DocumentCode :

3330725

Title :

Building algorithmically nonstop fault tolerant MPI programs

Author :

Wang, Rui ; Yao, Erlin ; Chen, Mingyu ; Tan, Guangming ; Balaji, Pavan ; Buntinas, Darius

Author_Institution :

State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China

fYear :

2011

fDate :

18-21 Dec. 2011

Firstpage :

Lastpage :

Abstract :

With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node; instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the correct solution can be recovered algorithmically at a very low cost. In order to implement the scheme, some new fault-tolerant features of the Message Passing Interface (MPI) have been investigated and utilized in the MPICH implementation of MPI. We also describe a case study using High Performance Linpack (HPL) with these new features and evaluate the performance of both our new scheme and ABFT recovery. Experimental results show the advantage of our new scheme over ABFT recovery even in a small scale.

Keywords :

application program interfaces; fault tolerant computing; message passing; parallel processing; redundancy; ABFT recovery; HPC application; HPC system; MPICH implementation; algorithmically nonstop fault tolerant MPI program; corrupted node; fail-stop failure; fault-tolerant feature; high performance Linpack; high performance computing system; message passing interface; nonstop fault tolerance scheme; redundant node; stop-and-wait algorithm-based fault tolerance recovery technique; Checkpointing; Encoding; Fault tolerant systems; Libraries; Matrix decomposition; Redundancy; Algorithm-Based Fault Tolerance; High Performance Linpack; MPICH; Nonstop;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2011 18th International Conference on

Conference_Location :

Bangalore

Print_ISBN :

978-1-4577-1951-6

Electronic_ISBN :

978-1-4577-1949-3

Type :

conf

DOI :

10.1109/HiPC.2011.6152716

Filename :

6152716

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3330725