DocumentCode
3321656
Title
Logging and recovery in adaptive software distributed shared memory systems
Author
Kongmunvattana, Angkul ; Tzeng, Nian-Feng
Author_Institution
Center for Adv. Comput. Studies, Univ. of Southwestern Louisiana, Lafayette, LA, USA
fYear
1999
fDate
1999
Firstpage
202
Lastpage
211
Abstract
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however; the probability of software DSM failures increases as the system size grows. This paper presents a new efficient logging protocol for adaptive software DSM (ADSM), called adaptive logging (AL). It is suitable for both coordinated and independent checkpointing since it speeds up the recovery process and eliminates the unbounded rollback problem associated with independent checkpointing. By leveraging the existing coherence data maintained by ADSM, our AL protocol adapts to log only unrecoverable data (which cannot be recreated or retrieved after a failure) necessary for correct recovery, reducing both the number of messages logged and the amount of logged data. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our AL protocol against the previous message logging (ML) protocol by implementing both protocols in TreadMarks-based ADSM. The experimental results show that our AL protocol consistently outperforms the ML protocol: Our protocol increases the execution time slightly by 2% to 10% during failure-free execution, while the ML protocol lengthens the execution time by many folds due to its larger log size and higher number of messages logged. Our AL-based recovery also outperforms ML-based recovery by 9% to 17% under parallel application examined
Keywords
distributed shared memory systems; system recovery; adaptive logging; adaptive software; adaptive software DSM; coherent global address space; distributed shared memory systems; message-passing machines; shared memory abstract; workstation clusters; Access protocols; Application software; Coherence; Distributed computing; Electronic switching systems; Parallel programming; Programming profession; Software performance; Sun; Workstations;
fLanguage
English
Publisher
ieee
Conference_Titel
Reliable Distributed Systems, 1999. Proceedings of the 18th IEEE Symposium on
Conference_Location
Lausanne
ISSN
1060-9857
Print_ISBN
0-7695-0290-3
Type
conf
DOI
10.1109/RELDIS.1999.805096
Filename
805096
Link To Document