Title :
Lazy garbage collection of recovery state for fault-tolerant distributed shared memory
Author :
Sultan, Florin ; Nguyen, Thu D. ; Iftode, Liviu
Author_Institution :
Dept. of Comput. Sci., Rutgers Univ., Piscataway, NJ, USA
fDate :
10/1/2002 12:00:00 AM
Abstract :
We address the problem of garbage collection in a single-failure fault-tolerant home-based lazy release consistency (HLRC) distributed shared-memory (DSM) system based on independent checkpointing and logging. Our solution uses laziness in garbage collection and exploits consistency constraints of the HLRC memory model for low overhead and scalability. We prove safe bounds on the state that must be retained in the system to guarantee correct recovery after a failure. We devise two algorithms for garbage collection of checkpoints and logs, checkpoint garbage collection (CGC), and lazy log trimming (LLT). The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers. The challenge lies in controlling the size of the logs and the number of checkpoints without global synchronization while tolerating transient disruptions in communication. Evaluation results for real applications show that it effectively bounds the number of past checkpoints to be retained and the size of the logs in stable storage
Keywords :
distributed shared memory systems; fault tolerant computing; protocols; storage management; synchronisation; system recovery; checkpoint garbage collection; checkpointing; consistency constraints; fault-tolerant distributed shared memory; lazy garbage collection; lazy log trimming; local-area clusters; log-based rollback recovery; logging; low overhead; memory model; recovery state; scalability; single-failure fault-tolerant home-based lazy release consistency system; transient disruptions; Application software; Checkpointing; Clustering algorithms; Communication system control; Distributed computing; Fault tolerance; Fault tolerant systems; Large-scale systems; Scalability; Size control;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2002.1041885