Title :
Reduced overhead logging for rollback recovery in distributed shared memory
Author :
Suri, G. ; Jannsens, B. ; Fuchs, W.K.
Author_Institution :
AT&T Bell Labs., Murray Hill, NJ, USA
Abstract :
Rollback techniques that use message logging and deterministic replay can be used in parallel systems to recover a failed node without involving other nodes. Distributed shared memory (DSM) systems cannot directly apply message-passing logging techniques because they use inherently nondeterministic asynchronous communication. This paper presents new logging schemes that reduce the typically high overhead for logging in DSM. Our algorithm for sequentially consistent systems tracks rather than logs accesses to shared memory. In an extension of this method to lazy release consistency, the per-access overhead of tracking has been completely eliminated. Measurements with parallel applications show a significant reduction in failure-free overhead.<>
Keywords :
data loggers; distributed memory systems; fault tolerant computing; shared memory systems; system recovery; deterministic replay; distributed shared memory; failed node recovery; failure-free overhead; lazy release consistency; message logging; nondeterministic asynchronous communication; overhead logging; parallel systems; per-access overhead; rollback recovery; sequentially consistent systems; shared memory access tracking; Asynchronous communication; Checkpointing; Computer crashes; Concurrent computing; Distributed computing; Distributed processing; Hardware; Laboratories; Message passing; NASA;
Conference_Titel :
Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on
Conference_Location :
Pasadena, CA, USA
Print_ISBN :
0-8186-7079-7
DOI :
10.1109/FTCS.1995.466971