• DocumentCode
    688171
  • Title

    Event-Driven Fault Tolerance for Building Nonstop Active Message Programs

  • Author

    Chao Li ; Changhai Zhao ; Haihua Yan ; Jianlei Zhang

  • Author_Institution
    Sch. of Comput. Sci. & Eng., Beihang Univ., Beijing, China
  • fYear
    2013
  • fDate
    13-15 Nov. 2013
  • Firstpage
    382
  • Lastpage
    390
  • Abstract
    With the decreasing Mean Time Between Failures (MTBF) of high performance computing systems, process failure has become a normal phenomenon rather than an exception. The failures in high frequency lead to fault tolerance, a key feature of high performance applications. To provide fault tolerance interfaces for active message programs, this paper proposes a novel model called event-driven fault tolerance. The model converts each detected process failure into an event containing detailed failure information of the execution context, and then schedules the event up to application layer by executing user-directed event handlers to drive the program to recover from faults. Based on events, the model can provide applications with dynamic process groups and fault tolerant communication interfaces. We present an implementation of the model called EDFT (Event Driven Fault Tolerance) and describe its architecture, principle, components and application programming interfaces (API). To evaluate this model, we incorporate EDFT into a scientific application, PreStack Depth Migration (PSDM). Experiments are conducted by injecting various kinds of faults into PSDM when it is running. Experimental results show that for active message programs that demand high performance, event-driven fault tolerance model promises strong robustness, low overhead and high scalability.
  • Keywords
    application program interfaces; fault tolerant computing; parallel programming; system recovery; API; EDFT; MTBF; PSDM; application layer; application programming interfaces; dynamic process groups; event-driven fault tolerance; execution context; fault tolerance interfaces; fault tolerant communication interfaces; high-performance computing systems; mean time between failures; nonstop active message programs; prestack depth migration; process failure information; scientific application; user-directed event handler; Buildings; Computational modeling; Computer architecture; Fault tolerance; Fault tolerant systems; Runtime; Silicon; active messages; event-driven; fault tolerance; nonstop programs; reliability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on
  • Conference_Location
    Zhangjiajie
  • Type

    conf

  • DOI
    10.1109/HPCC.and.EUC.2013.62
  • Filename
    6831944