• DocumentCode
    1783388
  • Title

    FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery

  • Author

    Sato, Kiminori ; Moody, Adam ; Mohror, Kathryn ; Gamblin, Todd ; de Supinski, Bronis R. ; Maruyama, Naoya ; Matsuoka, Shingo

  • Author_Institution
    Dept. of Math. & Comput. Sci., Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2014
  • fDate
    19-23 May 2014
  • Firstpage
    1225
  • Lastpage
    1234
  • Abstract
    Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover a failed application before the next failure occurs. To address this problem, we present the Fault Tolerant Messaging Interface (FMI), which enables extremely low-latency recovery. FMI accomplishes this using a survivable communication runtime coupled with fast, in-memory C/R, and dynamic node allocation. FMI provides message-passing semantics similar to MPI, but applications written using FMI can run through failures. The FMI runtime software handles fault tolerance, including check pointing application state, restarting failed processes, and allocating additional nodes when needed. Our tests show that FMI runs with similar failure-free performance as MPI, but FMI incurs only a 28% overhead with a very high mean time between failures of 1 minute.
  • Keywords
    checkpointing; failure analysis; parallel processing; software fault tolerance; FMI runtime software; MPI; check pointing application state; checkpoint-restart; dynamic node allocation; extreme-scale systems; failure mitigation; failure rates; failure-free performance; fast in-memory C/R; fast recovery; fault tolerant messaging interface; higher-fidelity simulations; low-latency recovery; parallel file system; supercomputers; survivable communication runtime; transparent recovery; Fault tolerance; Fault tolerant systems; Hardware; Overlay networks; Peer-to-peer computing; Resource management; Runtime; Checkpoint/Restart; Fault tolerance; MPI;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
  • Conference_Location
    Phoenix, AZ
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4799-3799-8
  • Type

    conf

  • DOI
    10.1109/IPDPS.2014.126
  • Filename
    6877350