Event-Driven Fault Tolerance for Building Nonstop Active Message Programs

Author

Chao Li ; Changhai Zhao ; Haihua Yan ; Jianlei Zhang

Author_Institution

Sch. of Comput. Sci. & Eng., Beihang Univ., Beijing, China

fYear

2013

fDate

13-15 Nov. 2013

Firstpage

382

Lastpage

390

Abstract

With the decreasing Mean Time Between Failures (MTBF) of high performance computing systems, process failure has become a normal phenomenon rather than an exception. The failures in high frequency lead to fault tolerance, a key feature of high performance applications. To provide fault tolerance interfaces for active message programs, this paper proposes a novel model called event-driven fault tolerance. The model converts each detected process failure into an event containing detailed failure information of the execution context, and then schedules the event up to application layer by executing user-directed event handlers to drive the program to recover from faults. Based on events, the model can provide applications with dynamic process groups and fault tolerant communication interfaces. We present an implementation of the model called EDFT (Event Driven Fault Tolerance) and describe its architecture, principle, components and application programming interfaces (API). To evaluate this model, we incorporate EDFT into a scientific application, PreStack Depth Migration (PSDM). Experiments are conducted by injecting various kinds of faults into PSDM when it is running. Experimental results show that for active message programs that demand high performance, event-driven fault tolerance model promises strong robustness, low overhead and high scalability.

Keywords

application program interfaces; fault tolerant computing; parallel programming; system recovery; API; EDFT; MTBF; PSDM; application layer; application programming interfaces; dynamic process groups; event-driven fault tolerance; execution context; fault tolerance interfaces; fault tolerant communication interfaces; high-performance computing systems; mean time between failures; nonstop active message programs; prestack depth migration; process failure information; scientific application; user-directed event handler; Buildings; Computational modeling; Computer architecture; Fault tolerance; Fault tolerant systems; Runtime; Silicon; active messages; event-driven; fault tolerance; nonstop programs; reliability;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on

Conference_Location

Zhangjiajie

Type

conf

DOI

10.1109/HPCC.and.EUC.2013.62

Filename

6831944