Title :
Logging and recovery in adaptive software distributed shared memory systems
Author :
Kongmunvattana, Angkul ; Tzeng, Nian-Feng
Author_Institution :
Center for Adv. Comput. Studies, Univ. of Southwestern Louisiana, Lafayette, LA, USA
Abstract :
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however; the probability of software DSM failures increases as the system size grows. This paper presents a new efficient logging protocol for adaptive software DSM (ADSM), called adaptive logging (AL). It is suitable for both coordinated and independent checkpointing since it speeds up the recovery process and eliminates the unbounded rollback problem associated with independent checkpointing. By leveraging the existing coherence data maintained by ADSM, our AL protocol adapts to log only unrecoverable data (which cannot be recreated or retrieved after a failure) necessary for correct recovery, reducing both the number of messages logged and the amount of logged data. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our AL protocol against the previous message logging (ML) protocol by implementing both protocols in TreadMarks-based ADSM. The experimental results show that our AL protocol consistently outperforms the ML protocol: Our protocol increases the execution time slightly by 2% to 10% during failure-free execution, while the ML protocol lengthens the execution time by many folds due to its larger log size and higher number of messages logged. Our AL-based recovery also outperforms ML-based recovery by 9% to 17% under parallel application examined
Keywords :
distributed shared memory systems; system recovery; adaptive logging; adaptive software; adaptive software DSM; coherent global address space; distributed shared memory systems; message-passing machines; shared memory abstract; workstation clusters; Access protocols; Application software; Coherence; Distributed computing; Electronic switching systems; Parallel programming; Programming profession; Software performance; Sun; Workstations;
Conference_Titel :
Reliable Distributed Systems, 1999. Proceedings of the 18th IEEE Symposium on
Conference_Location :
Lausanne
Print_ISBN :
0-7695-0290-3
DOI :
10.1109/RELDIS.1999.805096