DocumentCode
2294072
Title
Adaptive and Fault Tolerant Simulation of Relativistic Particle Transport with Data-Level Checkpointing
Author
Li, Ruipeng ; Jiang, Hai ; Su, Hung-Chi ; Zhang, Bin ; Jenness, Jeff
Author_Institution
Dept. of Comput. Sci., Arkansas State Univ., Jonesboro, AR
fYear
2008
fDate
16-18 July 2008
Firstpage
345
Lastpage
352
Abstract
Many scientific applications exhibit high demands on memory storage and computing capability. Improvements in commodity processors and networks have provided an opportunity to support such scientific applications within an everyday computing infrastructure. Good applications need the ability to work in constantly changing environments. Adaptability and fault tolerance are essential. Based on simulation of relativistic particle transport, this paper proposes a data-level checkpointing scheme for common scientific applications. This scheme takes advantage of the regular program layout, dominant computing loops, and fine-grained iterations. Without handling stack and heap segments directly, only application data is saved and restored as the computation state. Checkpointing interval can be dynamically adjusted to satisfy sensitivity and efficiency requirements for feasible fault tolerance. With this periodic but fixed-location checkpointing scheme, the MPI- based simulation system can be reconfigured by being shut down first and then restarted on same or different computer clusters. Application data can be redistributed for the new configuration. Experimental results have demonstrated this scheme´s efficiency and effectiveness.
Keywords
application program interfaces; checkpointing; digital simulation; fault tolerant computing; iterative methods; message passing; natural sciences computing; MPI- based simulation system; adaptive simulation; commodity processors; computer clusters; computing capability; data-level checkpointing; dominant computing loops; fault tolerant simulation; fine-grained iterations; fixed-location checkpointing scheme; memory storage; regular program layout; relativistic particle transport; scientific applications; Application software; Checkpointing; Computational modeling; Computer crashes; Computer networks; Computer simulation; Distributed computing; Fault tolerance; Physics computing; Plasma simulation; Checkpointing; Fault Tolerance; Reconfiguration; Relativistic Particle Transport; Simulation;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Science and Engineering, 2008. CSE '08. 11th IEEE International Conference on
Conference_Location
Sao Paulo
Print_ISBN
978-0-7695-3193-9
Type
conf
DOI
10.1109/CSE.2008.54
Filename
4578252
Link To Document