Title : 
Two-level checkpoint/restart modeling for GPGPU
         
        
            Author : 
Laosooksathit, Supada ; Naksinehaboon, Nichamon ; Leangsuksan, Chokchai
         
        
            Author_Institution : 
Dept. of Comput. Sci., Louisiana Tech Univ., Ruston, LA, USA
         
        
        
        
        
        
            Abstract : 
Due to the fact that the reliability and availability of a large scaled system inverse to the number of computing elements, fault tolerance has become a major concern in high performance computing (HPC) including a very large system with GPGPU. In this paper, we propose a checkpoint/restart mechanism model which employs two-phase protocol and a latency hiding technique such as CUDA streams in order to achieve a low checkpoint overhead. We introduce GPU checkpoint and restart protocols. Also, we show experimental results and analyze the influences of the mechanism, especially in a long-running application.
         
        
            Keywords : 
checkpointing; fault tolerant computing; graphics processing units; CUDA streams; GPGPU; fault tolerance; high performance computing; large scaled system; latency hiding technique; restart protocols; two-level checkpoint mechanism modeling; two-level restart mechanism modeling; two-phase protocol; Arrays; Checkpointing; Fault tolerance; Fault tolerant systems; Graphics processing unit; Kernel; Protocols;
         
        
        
        
            Conference_Titel : 
Computer Systems and Applications (AICCSA), 2011 9th IEEE/ACS International Conference on
         
        
            Conference_Location : 
Sharm El-Sheikh
         
        
        
            Print_ISBN : 
978-1-4577-0475-8
         
        
            Electronic_ISBN : 
2161-5322
         
        
        
            DOI : 
10.1109/AICCSA.2011.6126619