• DocumentCode
    3321656
  • Title

    Logging and recovery in adaptive software distributed shared memory systems

  • Author

    Kongmunvattana, Angkul ; Tzeng, Nian-Feng

  • Author_Institution
    Center for Adv. Comput. Studies, Univ. of Southwestern Louisiana, Lafayette, LA, USA
  • fYear
    1999
  • fDate
    1999
  • Firstpage
    202
  • Lastpage
    211
  • Abstract
    Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however; the probability of software DSM failures increases as the system size grows. This paper presents a new efficient logging protocol for adaptive software DSM (ADSM), called adaptive logging (AL). It is suitable for both coordinated and independent checkpointing since it speeds up the recovery process and eliminates the unbounded rollback problem associated with independent checkpointing. By leveraging the existing coherence data maintained by ADSM, our AL protocol adapts to log only unrecoverable data (which cannot be recreated or retrieved after a failure) necessary for correct recovery, reducing both the number of messages logged and the amount of logged data. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our AL protocol against the previous message logging (ML) protocol by implementing both protocols in TreadMarks-based ADSM. The experimental results show that our AL protocol consistently outperforms the ML protocol: Our protocol increases the execution time slightly by 2% to 10% during failure-free execution, while the ML protocol lengthens the execution time by many folds due to its larger log size and higher number of messages logged. Our AL-based recovery also outperforms ML-based recovery by 9% to 17% under parallel application examined
  • Keywords
    distributed shared memory systems; system recovery; adaptive logging; adaptive software; adaptive software DSM; coherent global address space; distributed shared memory systems; message-passing machines; shared memory abstract; workstation clusters; Access protocols; Application software; Coherence; Distributed computing; Electronic switching systems; Parallel programming; Programming profession; Software performance; Sun; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Reliable Distributed Systems, 1999. Proceedings of the 18th IEEE Symposium on
  • Conference_Location
    Lausanne
  • ISSN
    1060-9857
  • Print_ISBN
    0-7695-0290-3
  • Type

    conf

  • DOI
    10.1109/RELDIS.1999.805096
  • Filename
    805096