• DocumentCode
    2980643
  • Title

    Supporting User-directed Fault Tolerance over Standard MPI

  • Author

    Zhimin Wu ; Rui Wang ; Weizhi Xu ; Mingyu Chen ; Erlin Yao

  • Author_Institution
    State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China
  • fYear
    2012
  • fDate
    17-19 Dec. 2012
  • Firstpage
    696
  • Lastpage
    697
  • Abstract
    User-directed means the process of carrying out fault tolerance is dynamic and the fault tolerance mode is chosen by users based on application requirements. In this paper, we introduce a general scheme based on standard MPI to provide the user directed support for application level algorithmic fault tolerance. The user-directed fault tolerance plays the role as a connection between applications and algorithmic fault tolerance. As a case study, our scheme has been incorporated to HPL combined with a non-blocking ABFT technique. We have tested the functional availability of our scheme for fault tolerance in real circumstance. We also evaluated that when there is no failure occurring, our support only brings 2.5 percent overhead. When failure occurs, with our scheme, the scalability of algorithmic fault tolerance maintains well.
  • Keywords
    application program interfaces; fault tolerant computing; message passing; HPL; application level algorithmic fault tolerance; functional availability; nonblocking ABFT technique; standard MPI; user-directed fault tolerance mode; Algorithm design and analysis; Conferences; Detectors; Fault tolerance; Fault tolerant systems; Scalability; Standards; HPL; algorithmic fault tolerance; application-level; standard MPI; user-directed fault tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on
  • Conference_Location
    Singapore
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4673-4565-1
  • Electronic_ISBN
    1521-9097
  • Type

    conf

  • DOI
    10.1109/ICPADS.2012.100
  • Filename
    6413632