Title :
Fault-tolerant distributed simulation
Author :
Damani, Om P. ; Garg, Vijay K.
Author_Institution :
Dept. of Comput. Sci., Texas Univ., Austin, TX, USA
Abstract :
In traditional distributed simulation schemes, the entire simulation needs to be restarted if any of the participating logical processes (LPs) crash. This is highly undesirable for long running simulations. Some form of fault tolerance is required to minimize the wasted computation. A rollback based optimistic fault tolerance scheme is integrated with an optimistic distributed simulation scheme. In rollback recovery schemes, checkpoints are periodically saved on stable storage. After a crash, these saved checkpoints are used to restart the computation. We make use of the novel insight that a failure can be modeled as a straggler event with the receive time equal to the virtual time of the last checkpoint saved on stable storage. This results in saving of implementation efforts, as well as reduced overheads. We define stable global virtual time (SGVT), as the virtual time such that no state with a lower timestamp will ever be rolled back despite crash failures. A simple change is made in existing GVT algorithms to compute SGVT. Our use of transitive dependency tracking eliminates antimessages. LPs are clubbed in clusters to minimize stable storage access time
Keywords :
digital simulation; distributed algorithms; fault tolerant computing; system recovery; GVT algorithms; SGVT; antimessages; checkpoints; fault tolerant distributed simulation; logical processes; long running simulations; optimistic distributed simulation scheme; receive time; rollback based optimistic fault tolerance scheme; rollback recovery schemes; saved checkpoints; stable global virtual time; stable storage; stable storage access time; straggler event; transitive dependency tracking; virtual time; wasted computation; Application software; Clustering algorithms; Computational modeling; Computer bugs; Computer crashes; Distributed computing; Fault tolerance; Hardware; Operating systems; Software systems;
Conference_Titel :
Parallel and Distributed Simulation, 1998. PADS 98. Proceedings. Twelfth Workshop on
Conference_Location :
Banff, Alta.
Print_ISBN :
0-8186-8457-7
DOI :
10.1109/PADS.1998.685268