• DocumentCode
    308580
  • Title

    Fault recovery for distributed shared memory systems

  • Author

    Dieter, William R. ; Lumpp, James E., Jr.

  • Author_Institution
    Dept. of Electr. Eng., Kentucky Univ., Lexington, KY, USA
  • Volume
    2
  • fYear
    1997
  • fDate
    1-8 Feb 1997
  • Firstpage
    525
  • Abstract
    Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via “checkpointing” techniques that allow applications to “roll back” to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems
  • Keywords
    distributed memory systems; fault tolerant computing; message passing; probability; shared memory systems; DSM systems; classification; distributed shared memory systems; fault recovery; fault tolerance; high-performance computing; message passing architectures; network technology; price/performance of workstations; probability; Checkpointing; Computer networks; Fault tolerance; High performance computing; Large-scale systems; Message passing; Parallel architectures; Programming profession; Scalability; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Aerospace Conference, 1997. Proceedings., IEEE
  • Conference_Location
    Snowmass at Aspen, CO
  • Print_ISBN
    0-7803-3741-7
  • Type

    conf

  • DOI
    10.1109/AERO.1997.577998
  • Filename
    577998