DocumentCode
3459229
Title
An Application-Level Synchronous Checkpoint-Recover Method for Parallel CFD Simulation
Author
Ren Xiaoguang ; Xu Xinhai ; Tang Yuhua ; Fang Xudong ; Sen Ye
Author_Institution
State Key Lab. of High Performance Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear
2013
fDate
3-5 Dec. 2013
Firstpage
58
Lastpage
65
Abstract
High Performance Computing (HPC) is increasingly being used in Computational Fluid Dynamics (CFD) simulation for acceleration. However, CFD simulation faces serious reliability problems, and fault tolerant technology must be taken to ensure the efficient execution of the large-scale parallel CFD simulation. In this paper, we propose an application-level synchronous checkpoint-recover method for parallel CFD simulation on the basis of the application features of CFD simulation. In this method, the periodic snapshot output in the CFD simulation is naturally treated as a blocking coordinated checkpoint, and all the processes can resume the execution from the latest checkpoint with an arbitrary number of fail processes. We design the synchronous checkpoint-recovery framework for CFD simulation, and implement it in the open source software Open FOAM. Experimental results demonstrate that our method can well support the fault tolerant in large-scale parallel CFD applications with very little additional overhead on the original cost of CFD periodic snapshot output.
Keywords
checkpointing; computational fluid dynamics; fault tolerant computing; parallel processing; public domain software; CFD periodic snapshot output; HPC; Open FOAM; application-level synchronous checkpoint-recover method; blocking coordinated checkpoint; computational fluid dynamics; fail process; fault tolerant technology; high performance computing; large-scale parallel CFD simulation execution; open source software; process execution; reliability problem; Computational fluid dynamics; Computational modeling; Fault tolerance; Fault tolerant systems; Libraries; Maintenance engineering; Mathematical model; CFD; Checkpoint-Recover; Fault tolerant; HPC;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
Conference_Location
Sydney, NSW
Type
conf
DOI
10.1109/CSE.2013.19
Filename
6755197
Link To Document