Title :
Fault-tolerance using cache-coherent distributed shared memory systems
Author :
Hecht, D.L. ; Kavi, K.M. ; Gaede, R.K. ; Katsinis, C.
Author_Institution :
Alabama Univ., Huntsville, AL, USA
Abstract :
Describes new protocols augmenting traditional cache coherency mechanisms to implement fault tolerance based on recovery blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a “domino effect” whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One set of such techniques requires communicating processes to periodically synchronize in order to checkpoint a globally consistent state. These schemes can be implemented more naturally on distributed shared memory systems using synchronization on shared memory. We have developed extensions to well-known cache-coherency methods (e.g. directory-based) for the implementation of checkpointing consistent states
Keywords :
cache storage; coherence; distributed shared memory systems; fault tolerant computing; memory protocols; synchronisation; system recovery; cache-coherent distributed shared memory systems; checkpointing; communicating process synchronization; concurrent processes; directory-based cache-coherency methods; domino effect; fault tolerance; globally consistent state; protocols; recovery blocks; rollback recovery; Decision support systems; Fault tolerant systems;
Conference_Titel :
Parallel Architectures, Algorithms, and Networks, 1999. (I-SPAN '99) Proceedings. Fourth InternationalSymposium on
Conference_Location :
Perth/Fremantle, WA
Print_ISBN :
0-7695-0231-8
DOI :
10.1109/ISPAN.1999.778924