DocumentCode :
549783
Title :
A new parallel recomputing code design methodology for fault-tolerant parallel algorithm
Author :
Du, Yunfei ; Peng, Lin ; Zhao, Kejia
Author_Institution :
Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear :
2011
fDate :
27-30 June 2011
Firstpage :
220
Lastpage :
226
Abstract :
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Fault-tolerant Parallel Algorithm (FTPA) is an application-level fault-tolerant approach for large-scale scientific applications, and it can achieve fast self-recovery through parallel recomputing. In this paper, first we propose a new parallel recomputing code design methodology, and the parallel recomputing code designed by the methodology can achieve a high efficiency of parallel recomputing. Second, the parallel recomputing code design methodology is automated by exploring the use of compiler technology. Finally, we evaluate the performance of our approach with two kernels of NAS Parallel Benchmarks on a cluster system with 512 CPUs. The experimental results show that the parallel recomputing code generated by our approach has a higher efficiency of parallel recomputing than the code generated by loop parallelization.
Keywords :
parallel algorithms; program compilers; software fault tolerance; software performance evaluation; CPU; FTPA; NAS parallel benchmarks; application-level fault-tolerant approach; cluster system; compiler technology; fault-tolerant parallel algorithm; hardware failures; large-scale computer systems; large-scale scientific applications; loop parallelization; mean-time-between-failures; parallel recomputing code design methodology; parallel recomputing code generation; performance evaluation; self-recovery; Arrays; Computers; Design methodology; Fault tolerance; Fault tolerant systems; Parallel processing; Program processors; parallel recomputing code; slice; template;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Performance Evaluation of Computer & Telecommunication Systems (SPECTS), 2011 International Symposium on
Conference_Location :
The Hague
Print_ISBN :
978-1-4577-0139-9
Electronic_ISBN :
978-1-61782-309-1
Type :
conf
Filename :
5984869
Link To Document :
بازگشت