• DocumentCode
    3558951
  • Title

    FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

  • Author

    Yang, Xuejun ; Du, Yunfei ; Wang, Panfeng ; Fu, Hongyi ; Jia, Jia

  • Author_Institution
    Nat. Univ. of Defense Technol., Changsha, China
  • Volume
    20
  • Issue
    10
  • fYear
    2009
  • Firstpage
    1471
  • Lastpage
    1486
  • Abstract
    As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the fault-tolerant parallel algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed get it fault-tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.
  • Keywords
    fault tolerant computing; parallel processing; program compilers; FTPA; application-level fault-tolerant approach; fault-tolerant parallel algorithm; fault-tolerant parallel computing; get it fault-tolerant source-to-source precompiler tool; mean-time-between-failures; parallel recomputing; rollback-recovery protocols; Concurrent Programming; Fault tolerance; Fault-tolerance; Operating Systems; Reliability; Software/Software Engineering; fast self-recovery; fault-tolerant parallel algorithm; parallel recomputing.;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • Conference_Location
    10/17/2008 12:00:00 AM
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2008.231
  • Filename
    4653486