DocumentCode
2980643
Title
Supporting User-directed Fault Tolerance over Standard MPI
Author
Zhimin Wu ; Rui Wang ; Weizhi Xu ; Mingyu Chen ; Erlin Yao
Author_Institution
State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China
fYear
2012
fDate
17-19 Dec. 2012
Firstpage
696
Lastpage
697
Abstract
User-directed means the process of carrying out fault tolerance is dynamic and the fault tolerance mode is chosen by users based on application requirements. In this paper, we introduce a general scheme based on standard MPI to provide the user directed support for application level algorithmic fault tolerance. The user-directed fault tolerance plays the role as a connection between applications and algorithmic fault tolerance. As a case study, our scheme has been incorporated to HPL combined with a non-blocking ABFT technique. We have tested the functional availability of our scheme for fault tolerance in real circumstance. We also evaluated that when there is no failure occurring, our support only brings 2.5 percent overhead. When failure occurs, with our scheme, the scalability of algorithmic fault tolerance maintains well.
Keywords
application program interfaces; fault tolerant computing; message passing; HPL; application level algorithmic fault tolerance; functional availability; nonblocking ABFT technique; standard MPI; user-directed fault tolerance mode; Algorithm design and analysis; Conferences; Detectors; Fault tolerance; Fault tolerant systems; Scalability; Standards; HPL; algorithmic fault tolerance; application-level; standard MPI; user-directed fault tolerance;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on
Conference_Location
Singapore
ISSN
1521-9097
Print_ISBN
978-1-4673-4565-1
Electronic_ISBN
1521-9097
Type
conf
DOI
10.1109/ICPADS.2012.100
Filename
6413632
Link To Document