DocumentCode
3237350
Title
Scalable Fault-Tolerant Distributed Shared Memory
Author
Sultan, Florin ; Nguyen, Thu ; Iftode, Liviu
Author_Institution
Rutgers University
fYear
2000
fDate
04-10 Nov. 2000
Firstpage
20
Lastpage
20
Abstract
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent check- pointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointing overheads, and that our management algorithms can be expected to effectively bound the size of the checkpoints and logs or real applications.
Keywords
Checkpointing; Clustering algorithms; Computer science; Costs; Fault tolerance; Fault tolerant systems; Home computing; Protocols; Scalability; Size control;
fLanguage
English
Publisher
ieee
Conference_Titel
Supercomputing, ACM/IEEE 2000 Conference
ISSN
1063-9535
Print_ISBN
0-7803-9802-5
Type
conf
DOI
10.1109/SC.2000.10014
Filename
1592733
Link To Document