Title :
On tolerating faults in naturally redundant algorithms
Author :
Laranjeira, Luiz A. ; Malek, Miroslaw ; Jenevein, R.
Author_Institution :
Dept. of Electr. & Comput, Eng., Texas Univ., Austin, TX, USA
fDate :
30 Sep-2 Oct 1991
Abstract :
A class of algorithms suitable for fault-tolerant execution in multiprocessor systems by exploiting the existing embedded redundancy in the problem variables is characterized. Because of this unique property, no extra computations need be superimposed on the algorithm in order to provide redundancy for fault recovery, as well as fault detection in some cases. A forward recovery scheme is thus used with very low time overhead. The method is applied to the implementation of two iterative algorithms: solution of Laplace equations by Jacobi´s method and the calculation of the invariant distribution of a Markov chain. Experiments show less than 15% performance degradation for significant problem instances in fault-free situations, and as low as 2.43% in some cases. The extra computation time needed for locating and recovering from a detected fault does not exceed the time necessary to execute a single iteration. The fault-detection procedures provide fault coverage close to 100% for faults causing errors that affect the correctness of the computations
Keywords :
Laplace transforms; Markov processes; fault tolerant computing; iterative methods; multiprocessing systems; Jacobi´s method; Laplace equations; Markov chain; embedded redundancy; fault detection; fault recovery; fault tolerant computing; forward recovery scheme; iterative algorithms; multiprocessor systems; naturally redundant algorithms; Degradation; Fault detection; Fault tolerance; Fault tolerant systems; Hardware; Iterative algorithms; Jacobian matrices; Laplace equations; Multiprocessing systems; Redundancy;
Conference_Titel :
Reliable Distributed Systems, 1991. Proceedings., Tenth Symposium on
Conference_Location :
Pisa
Print_ISBN :
0-8186-2260-1
DOI :
10.1109/RELDIS.1991.145413