مرکز منطقه ای اطلاع رساني علوم و فناوري - Lazy garbage collection of recovery state for fault-tolerant distributed shared memory

DocumentCode :

788535

Title :

Lazy garbage collection of recovery state for fault-tolerant distributed shared memory

Author :

Sultan, Florin ; Nguyen, Thu D. ; Iftode, Liviu

Author_Institution :

Dept. of Comput. Sci., Rutgers Univ., Piscataway, NJ, USA

Volume :

Issue :

fYear :

2002

fDate :

7/1/2002 12:00:00 AM

Firstpage :

673

Lastpage :

686

Abstract :

In this paper, we address the problem of garbage collection in a single-failure fault-tolerant home-based lazy release consistency (HLRC) distributed shared-memory (DSM) system based on independent checkpointing and logging. Our solution uses laziness in garbage collection and exploits consistency constraints of the HLRC memory model for low overhead and scalability. We prove safe bounds on the state that must be retained in the system to guarantee correct recovery after a failure. We devise two algorithms for garbage collection of checkpoints and logs, checkpoint garbage collection (CGC), and lazy log trimming (LLT). The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers. In such systems, using global synchronization or extra communication for garbage collection is inefficient or simply impractical due to system scale and temporary disconnections in communication. The challenge lies in controlling the size of the logs and the number of checkpoints without global synchronization while tolerating transient disruptions in communication. Our garbage collection scheme is completely distributed, does not force processes to synchronize, does not add extra messages to the base DSM protocol, and uses only the available DSM protocol information. Evaluation results for real applications show that it effectively bounds the number of past checkpoints to be retained and the size of the logs in stable storage

Keywords :

distributed shared memory systems; fault tolerant computing; protocols; storage management; system recovery; workstation clusters; DSM protocol; checkpoint garbage collection; consistency constraints; correct recovery; extra communication; global synchronization; independent checkpointing; lazy garbage collection; lazy log trimming; local area computer clusters; logging; low overhead; recovery state; safe bounds; scalability; single-failure fault-tolerant home-based lazy release consistency distributed shared-memory system; transient disruption tolerance; Checkpointing; Clustering algorithms; Communication system control; Distributed computing; Fault tolerance; Fault tolerant systems; Large-scale systems; Protocols; Scalability; Size control;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2002.1019857

Filename :

1019857

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=788535