Title :
Improving the scalability of transparent checkpointing for GPU computing systems
Author :
Amrizal, A. ; Hirasawa, Shoichi ; Komatsu, Kazuhiko ; Takizawa, Hiroyuki ; Kobayashi, Hideo
Author_Institution :
Grad. Sch. of Inf. Sci., Tohoku Univ., Sendai, Japan
Abstract :
As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing node´s local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; file organisation; graphics processing units; input-output programs; GPU computing systems; I-O bottlenecks; OpenCL applications; catastrophic failure; fault tolerance technique; global CheCL; global checkpoint file; global file system; local CheCL; network congestion; node local storage utilization; restart mechanism; scalability improvement; scalable checkpoint mechanism; transparent checkpointing process; two-level CheCL; two-level checkpoint model; Benchmark testing; Checkpointing; Computational modeling; Graphics processing units; Mathematical model; Random access memory; Scalability;
Conference_Titel :
TENCON 2012 - 2012 IEEE Region 10 Conference
Conference_Location :
Cebu
Print_ISBN :
978-1-4673-4823-2
Electronic_ISBN :
2159-3442
DOI :
10.1109/TENCON.2012.6412343