Title :
An Application-Level Synchronous Checkpoint-Recover Method for Parallel CFD Simulation
Author :
Ren Xiaoguang ; Xu Xinhai ; Tang Yuhua ; Fang Xudong ; Sen Ye
Author_Institution :
State Key Lab. of High Performance Comput., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
High Performance Computing (HPC) is increasingly being used in Computational Fluid Dynamics (CFD) simulation for acceleration. However, CFD simulation faces serious reliability problems, and fault tolerant technology must be taken to ensure the efficient execution of the large-scale parallel CFD simulation. In this paper, we propose an application-level synchronous checkpoint-recover method for parallel CFD simulation on the basis of the application features of CFD simulation. In this method, the periodic snapshot output in the CFD simulation is naturally treated as a blocking coordinated checkpoint, and all the processes can resume the execution from the latest checkpoint with an arbitrary number of fail processes. We design the synchronous checkpoint-recovery framework for CFD simulation, and implement it in the open source software Open FOAM. Experimental results demonstrate that our method can well support the fault tolerant in large-scale parallel CFD applications with very little additional overhead on the original cost of CFD periodic snapshot output.
Keywords :
checkpointing; computational fluid dynamics; fault tolerant computing; parallel processing; public domain software; CFD periodic snapshot output; HPC; Open FOAM; application-level synchronous checkpoint-recover method; blocking coordinated checkpoint; computational fluid dynamics; fail process; fault tolerant technology; high performance computing; large-scale parallel CFD simulation execution; open source software; process execution; reliability problem; Computational fluid dynamics; Computational modeling; Fault tolerance; Fault tolerant systems; Libraries; Maintenance engineering; Mathematical model; CFD; Checkpoint-Recover; Fault tolerant; HPC;
Conference_Titel :
Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
Conference_Location :
Sydney, NSW
DOI :
10.1109/CSE.2013.19