Title :
Algorithm-based fault location and recovery for matrix computations
Author :
Roy-Chowdhury, A. ; Banerjee, P.
Author_Institution :
Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
Abstract :
Previous algorithm-based methods for developing reliable versions of numerical algorithms have mostly concerned themselves with error detection. A truly fault tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In our paper, we discuss in detail fault tolerant version of a matrix multiplication algorithm. The ideas developed in the derivation of the fault-tolerant matrix multiplication algorithms may be used to derive fault-tolerant versions of other numerical algorithms. We outline how two other numerical algorithms, QR factorization and Gaussian Elimination may be made fault-tolerant using our approach. Our fault model assumes that a faulty processor can corrupt all the data it possesses. We present error coverage and overhead results for the single faulty processor case for fault-locating and fault-tolerant versions of three numerical algorithms on an Intel iPSC/2 hypercube multicomputer.<>
Keywords :
fault tolerant computing; matrix algebra; parallel algorithms; system recovery; Gaussian Elimination; Intel iPSC/2 hypercube multicomputer; QR factorization; algorithm-based methods; error detection; fault tolerant algorithm; matrix computations; matrix multiplication algorithm; numerical algorithms; parallel processing environment; reliable versions; Algorithm design and analysis; Contracts; Costs; Encoding; Error correction; Fault detection; Fault location; Fault tolerance; Hardware; Parallel algorithms;
Conference_Titel :
Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers., Twenty-Fourth International Symposium on
Conference_Location :
Austin, TX, USA
Print_ISBN :
0-8186-5520-8
DOI :
10.1109/FTCS.1994.315659