Title :
Progressive retry for software error recovery in distributed systems
Author :
Wang, Yi-Min ; Huang, Yennun ; Fuchs, Kent W.
Author_Institution :
Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
Abstract :
A method of execution retry for bypassing software faults based on checkpointing, rollback, message reordering, and replaying is described. The authors demonstrate how rollback techniques, previously developed for transient hardware failure recovery, can also be used to recover from software errors by exploiting message reordering to bypass software faults. The approach intentionally increases the degree of nondeterminism and the scope of rollback when a previous retry fails. Examples from experience with telecommunications software systems illustrate the benefits of the scheme.
Keywords :
software fault tolerance; checkpointing; distributed systems; execution retry; message reordering; nondeterminism; progressive retry; rollback; software error recovery; telecommunications software systems; transient hardware failure recovery; Checkpointing; Computer errors; Contracts; Costs; Hardware; NASA; Protocols; Runtime; Software systems; Testing;
Conference_Titel :
Fault-Tolerant Computing, 1993. FTCS-23. Digest of Papers., The Twenty-Third International Symposium on
Conference_Location :
Toulouse, France
Print_ISBN :
0-8186-3680-7
DOI :
10.1109/FTCS.1993.627317