DocumentCode
688171
Title
Event-Driven Fault Tolerance for Building Nonstop Active Message Programs
Author
Chao Li ; Changhai Zhao ; Haihua Yan ; Jianlei Zhang
Author_Institution
Sch. of Comput. Sci. & Eng., Beihang Univ., Beijing, China
fYear
2013
fDate
13-15 Nov. 2013
Firstpage
382
Lastpage
390
Abstract
With the decreasing Mean Time Between Failures (MTBF) of high performance computing systems, process failure has become a normal phenomenon rather than an exception. The failures in high frequency lead to fault tolerance, a key feature of high performance applications. To provide fault tolerance interfaces for active message programs, this paper proposes a novel model called event-driven fault tolerance. The model converts each detected process failure into an event containing detailed failure information of the execution context, and then schedules the event up to application layer by executing user-directed event handlers to drive the program to recover from faults. Based on events, the model can provide applications with dynamic process groups and fault tolerant communication interfaces. We present an implementation of the model called EDFT (Event Driven Fault Tolerance) and describe its architecture, principle, components and application programming interfaces (API). To evaluate this model, we incorporate EDFT into a scientific application, PreStack Depth Migration (PSDM). Experiments are conducted by injecting various kinds of faults into PSDM when it is running. Experimental results show that for active message programs that demand high performance, event-driven fault tolerance model promises strong robustness, low overhead and high scalability.
Keywords
application program interfaces; fault tolerant computing; parallel programming; system recovery; API; EDFT; MTBF; PSDM; application layer; application programming interfaces; dynamic process groups; event-driven fault tolerance; execution context; fault tolerance interfaces; fault tolerant communication interfaces; high-performance computing systems; mean time between failures; nonstop active message programs; prestack depth migration; process failure information; scientific application; user-directed event handler; Buildings; Computational modeling; Computer architecture; Fault tolerance; Fault tolerant systems; Runtime; Silicon; active messages; event-driven; fault tolerance; nonstop programs; reliability;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on
Conference_Location
Zhangjiajie
Type
conf
DOI
10.1109/HPCC.and.EUC.2013.62
Filename
6831944
Link To Document