DocumentCode :
1926276
Title :
DCR: A fully transparent checkpoint/restart framework for distributed systems
Author :
Ma, Can ; Huo, Zhigang ; Cai, Jingnan ; Meng, Dan
Author_Institution :
Inst. of Comput. Technol., Nat. Res. Center for Intell. Comput. Syst., Chinese Acad. of Sci., Beijing, China
fYear :
2009
fDate :
Aug. 31 2009-Sept. 4 2009
Firstpage :
1
Lastpage :
10
Abstract :
Checkpoint/restart has been widely used in computing systems for fault tolerance, job scheduling and system maintenance purposes. However, the lack of transparency has hindered adoptions of many implementations of it. In this paper, we present a fully transparent parallel checkpoint/restart framework, DCR, which takes the advantages of kernel-level checkpointing method and TCP session preservation. DCR is fully transparent to application programmers and users. No source code modifications, recompilations, or system call interceptions are required. Because of the simplicity of its design and the dominance of TCP/IP in parallel applications, DCR can be readily deployed in widely scales of computers, from single CPU computers to large-scale clusters. A new on-demand blocking checkpoint protocol, which makes use of the reliability mechanism of TCP, is proposed to eliminate the global synchronization. We have demonstrated the effectiveness and efficiency of DCR by multiple MPICH2 applications running on Dawning 5000A.
Keywords :
checkpointing; operating system kernels; parallel processing; scheduling; software fault tolerance; software maintenance; transport protocols; workstation clusters; DCR; Dawning 5000A; MPICH2 applications; TCP session preservation; TCP/IP; distributed systems; fault tolerance; job scheduling; kernel-level checkpointing method; large-scale clusters; on-demand blocking checkpoint protocol; parallel applications; reliability mechanism; system maintenance; transparent parallel checkpoint framework; transparent restart framework; Application software; Central Processing Unit; Checkpointing; Concurrent computing; Fault tolerant systems; Large-scale systems; Processor scheduling; Programming profession; Protocols; TCPIP;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location :
New Orleans, LA
ISSN :
1552-5244
Print_ISBN :
978-1-4244-5011-4
Electronic_ISBN :
1552-5244
Type :
conf
DOI :
10.1109/CLUSTR.2009.5289172
Filename :
5289172
Link To Document :
بازگشت