DocumentCode
549783
Title
A new parallel recomputing code design methodology for fault-tolerant parallel algorithm
Author
Du, Yunfei ; Peng, Lin ; Zhao, Kejia
Author_Institution
Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear
2011
fDate
27-30 June 2011
Firstpage
220
Lastpage
226
Abstract
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Fault-tolerant Parallel Algorithm (FTPA) is an application-level fault-tolerant approach for large-scale scientific applications, and it can achieve fast self-recovery through parallel recomputing. In this paper, first we propose a new parallel recomputing code design methodology, and the parallel recomputing code designed by the methodology can achieve a high efficiency of parallel recomputing. Second, the parallel recomputing code design methodology is automated by exploring the use of compiler technology. Finally, we evaluate the performance of our approach with two kernels of NAS Parallel Benchmarks on a cluster system with 512 CPUs. The experimental results show that the parallel recomputing code generated by our approach has a higher efficiency of parallel recomputing than the code generated by loop parallelization.
Keywords
parallel algorithms; program compilers; software fault tolerance; software performance evaluation; CPU; FTPA; NAS parallel benchmarks; application-level fault-tolerant approach; cluster system; compiler technology; fault-tolerant parallel algorithm; hardware failures; large-scale computer systems; large-scale scientific applications; loop parallelization; mean-time-between-failures; parallel recomputing code design methodology; parallel recomputing code generation; performance evaluation; self-recovery; Arrays; Computers; Design methodology; Fault tolerance; Fault tolerant systems; Parallel processing; Program processors; parallel recomputing code; slice; template;
fLanguage
English
Publisher
ieee
Conference_Titel
Performance Evaluation of Computer & Telecommunication Systems (SPECTS), 2011 International Symposium on
Conference_Location
The Hague
Print_ISBN
978-1-4577-0139-9
Electronic_ISBN
978-1-61782-309-1
Type
conf
Filename
5984869
Link To Document