DocumentCode :
3453847
Title :
Checkpointing and recovery of shared memory parallel applications in a cluster
Author :
Badrinath, R. ; Morin, Christine ; Vallee, Geoffroy
Author_Institution :
IRISA/INRIA, France
fYear :
2003
fDate :
12-15 May 2003
Firstpage :
471
Lastpage :
477
Abstract :
This paper describes issues in the design and implementation of checkpointing and recovery modules for the Kerrighed DSM cluster system. Our design is for a DSM supporting the sequential consistency model. The mechanisms are general enough to be used in a number of different checkpointing and recovery protocols. It is designed to support common optimizations for performance suggested in literature, while staying light-weight during fault free execution. We also present preliminary performance results of the current implementation.
Keywords :
distributed shared memory systems; fault tolerant computing; protocols; system recovery; workstation clusters; Kerrighed DSM cluster system; checkpointing; cluster computing; fault tolerance; recovery protocol; shared memory parallel application; Checkpointing; Containers; Fault tolerant systems; Kernel; Linux; Memory management; Operating systems; Protocols; Random access memory; Research and development;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on
Print_ISBN :
0-7695-1919-9
Type :
conf
DOI :
10.1109/CCGRID.2003.1199403
Filename :
1199403
Link To Document :
بازگشت