• DocumentCode
    3678425
  • Title

    Building a Fault Tolerant Application Using the GASPI Communication Layer

  • Author

    Faisal Shahzad;Moritz Kreutzer;Thomas Zeiser;Rui Machado;Andreas Pieper;Georg Hager;Gerhard Wellein

  • Author_Institution
    Erlangen Regional Comput. Center, Univ. of Erlangen-Nuremberg, Erlangen, Germany
  • fYear
    2015
  • Firstpage
    580
  • Lastpage
    587
  • Abstract
    It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet very mature in handling failures, the User-Level Failure Mitigation (ULFM) proposal being currently the most promising approach is still in its prototype phase. In our work we use GASPI, which is a relatively new communication library based on the PGAS model. It provides the missing features to allow the design of fault-tolerant applications. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how we can build on (existing) clever checkpointing and extend applications to allow integrate a low cost fault detection mechanism and, if necessary, recover the application on the fly. The aspects of process management, the restoration of groups and the recovery mechanism is presented in detail. We use a sparse matrix vector multiplication based application to perform the analysis of the overhead introduced by such modifications. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.
  • Keywords
    Conferences
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing (CLUSTER), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/CLUSTER.2015.106
  • Filename
    7307655