• DocumentCode
    2386743
  • Title

    A Skeletal-Based Approach for the Development of Fault-Tolerant SPMD Applications

  • Author

    Makassikis, Constantinos ; Galtier, Virginie ; Vialle, Stephane

  • Author_Institution
    AlGorille INRIA Project Team, SUPELEC, France
  • fYear
    2010
  • fDate
    8-11 Dec. 2010
  • Firstpage
    239
  • Lastpage
    248
  • Abstract
    Distributing applications over PC clusters to speed-up or size-up the execution is now commonplace. Yet efficiently tolerating faults of these systems is a major issue. To ease the addition of checkpoint-based fault tolerance at the application level, we introduce a {em Model for Low-Overhead Tolerance of Faults}(MoLOToF) which is based on structuring applications using {em fault-tolerant skeletons}. MoLOToF also encourages collaborations with the programmer and the execution environment. The skeletons are adapted to specific parallelization paradigms and yield what can be called {em fault-tolerant algorithmic skeletons}. The application of MoLOToF to the SPMD parallelization paradigm results in our proposed FT-SPMD framework. Experiments show that the complexity for developing an application is small and the use of the framework has a small impact on performance. Comparisons with existing system-level checkpoint solutions, namely LAM/MPI and DMTCP, point out that FT-SPMD has a lower runtime overhead while being more robust when a higher level of fault tolerance is required.
  • Keywords
    software fault tolerance; MoLOToF; fault tolerant SPMD applications; model for low overhead tolerance of faults; skeletal based approach; Checkpointing; Collaboration; Fault tolerance; Fault tolerant systems; Programming; Routing; Skeleton; SPMD; application-level checkpointing; fault tolerance; framework; programming skeletons;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010 International Conference on
  • Conference_Location
    Wuhan
  • Print_ISBN
    978-1-4244-9110-0
  • Electronic_ISBN
    978-0-7695-4287-4
  • Type

    conf

  • DOI
    10.1109/PDCAT.2010.89
  • Filename
    5704425