• DocumentCode
    549783
  • Title

    A new parallel recomputing code design methodology for fault-tolerant parallel algorithm

  • Author

    Du, Yunfei ; Peng, Lin ; Zhao, Kejia

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2011
  • fDate
    27-30 June 2011
  • Firstpage
    220
  • Lastpage
    226
  • Abstract
    As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Fault-tolerant Parallel Algorithm (FTPA) is an application-level fault-tolerant approach for large-scale scientific applications, and it can achieve fast self-recovery through parallel recomputing. In this paper, first we propose a new parallel recomputing code design methodology, and the parallel recomputing code designed by the methodology can achieve a high efficiency of parallel recomputing. Second, the parallel recomputing code design methodology is automated by exploring the use of compiler technology. Finally, we evaluate the performance of our approach with two kernels of NAS Parallel Benchmarks on a cluster system with 512 CPUs. The experimental results show that the parallel recomputing code generated by our approach has a higher efficiency of parallel recomputing than the code generated by loop parallelization.
  • Keywords
    parallel algorithms; program compilers; software fault tolerance; software performance evaluation; CPU; FTPA; NAS parallel benchmarks; application-level fault-tolerant approach; cluster system; compiler technology; fault-tolerant parallel algorithm; hardware failures; large-scale computer systems; large-scale scientific applications; loop parallelization; mean-time-between-failures; parallel recomputing code design methodology; parallel recomputing code generation; performance evaluation; self-recovery; Arrays; Computers; Design methodology; Fault tolerance; Fault tolerant systems; Parallel processing; Program processors; parallel recomputing code; slice; template;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Performance Evaluation of Computer & Telecommunication Systems (SPECTS), 2011 International Symposium on
  • Conference_Location
    The Hague
  • Print_ISBN
    978-1-4577-0139-9
  • Electronic_ISBN
    978-1-61782-309-1
  • Type

    conf

  • Filename
    5984869