DocumentCode
167511
Title
Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver
Author
Ali, Md Mortuza ; Southern, James ; Strazdins, Peter ; Harding, Brendan
Author_Institution
Res. Sch. of Comput. Sci., Australian Nat. Univ., Canberra, ACT, Australia
fYear
2014
fDate
19-23 May 2014
Firstpage
1169
Lastpage
1178
Abstract
A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum´s Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of- fault-tolerant applications by means of the Open MPI ULFM standard.
Keywords
application program interfaces; fault tolerant computing; message passing; open systems; partial differential equations; resource allocation; system recovery; 2D partial differential equations; MPI forum; MPI processes; Open MPI ULFM standard; PDE solver; ULFM proposal; alternate sparse grid combination technique; application level fault recovery; approximated data recovery; checkpointing; data recovery overhead; draft ULFM; fault tolerance working group; fault-tolerant applications; fault-tolerant implementation; fault-tolerant open MPI; fault-tolerant version; faulty communicator reconstruction time; load balancing; near-exact copy; near-exact replication method; open message passing interface; replicated data; user level failure mitigation; user level failure recovery; very low disk write latency; Approximation methods; Educational institutions; Fault tolerance; Fault tolerant systems; Libraries; Standards; Synchronization; PDE solver; ULFM; approximation error; fault tolerance; process failure recovery; sparse grid combination;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International
Conference_Location
Phoenix, AZ
Print_ISBN
978-1-4799-4117-9
Type
conf
DOI
10.1109/IPDPSW.2014.132
Filename
6969514
Link To Document