DocumentCode :
3032921
Title :
A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy
Author :
Yulu Zhang ; Xinyuan Guo ; Hai Jiang ; Kuan-Ching Li
Author_Institution :
Dept. of Comput. Sci., Arkansas State Univ., Jonesboro, AR, USA
fYear :
2013
fDate :
1-3 July 2013
Firstpage :
247
Lastpage :
252
Abstract :
Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many scientific applications. However, as GPU becomes a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to GPU´s batch-mode execution manner. The paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states. A precompiler and run-time support module are developed to construct and save states in CPU system memory dynamically. Secondary storage can be utilized for scalability and long-term fault tolerance. CUDA applications with complicated memory use are support as well. Experimental results have demonstrated the effectiveness of the proposed scheme.
Keywords :
checkpointing; fault tolerant computing; parallel architectures; program compilers; storage management; CPU system memory; CUDA applications; GPU batch-mode execution; GPU computation state restoration; GPU computation state saving; application-level checkpoint-restart scheme; complex memory hierarchy; high performance computing; long-term fault tolerance; memory use; precompiler; run-time support module; scalability; scientific application; secondary storage; Arrays; Graphics processing units; Kernel; Libraries; Radiation detectors; Registers; CUDA; GPU; checkpoint/start;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2013 14th ACIS International Conference on
Conference_Location :
Honolulu, HI
Type :
conf
DOI :
10.1109/SNPD.2013.5
Filename :
6598473
Link To Document :
بازگشت