• DocumentCode
    3336144
  • Title

    A recoverable distributed shared memory integrating coherence and recoverability

  • Author

    Kermarrec, A.-M. ; Cabillic, G. ; Gefflaut, A. ; Morin, C. ; Puaut, I.

  • Author_Institution
    IRISA, Campus Univ. de Beaulieu, Rennes, France
  • fYear
    1995
  • fDate
    27-30 June 1995
  • Firstpage
    289
  • Lastpage
    298
  • Abstract
    Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failures. Although most recoverable DSMs require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM´s coherence protocol. This approach takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56-node Intel Paragon.<>
  • Keywords
    distributed memory systems; large-scale systems; performance evaluation; shared memory systems; system recovery; Intel Paragon; checkpointing mechanism; coherence; coherence protocol; data replication; large-scale distributed systems; long time running applications; parallel applications; performance evaluation; recoverability; recoverable distributed shared memory; recovery data management; single node failure tolerance; site failure; standard memories; transferred pages; Checkpointing; Concurrent computing; Distributed computing; Hardware; Large scale integration; Proposals; Protocols; Scalability; Stability; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on
  • Conference_Location
    Pasadena, CA, USA
  • Print_ISBN
    0-8186-7079-7
  • Type

    conf

  • DOI
    10.1109/FTCS.1995.466970
  • Filename
    466970