Title :
Algorithm-based fault tolerance on a hypercube multiprocessor
Author :
Banerjee, Prithviraj ; Rahmeh, Joe T. ; Stunkel, Craig ; Nair, V.S. ; Roy, Kaushik ; Balasubramanian, Vijay ; Abraham, Jacob A.
Author_Institution :
Dept. of Electr. Eng., Illinois Univ., Urbana, IL, USA
fDate :
9/1/1990 12:00:00 AM
Abstract :
The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors
Keywords :
fault tolerant computing; multiprocessing systems; parallel architectures; Gaussian elimination; Intel iPSC hypercube; error detection; fast Fourier transform; fault tolerance; faulty processors; hypercube multiprocessor; matrix multiplication; multiprocessor architecture; Computer architecture; Computer errors; Costs; Fault detection; Fault diagnosis; Fault tolerance; Hypercubes; Jacobian matrices; Joining processes; Parallel architectures;
Journal_Title :
Computers, IEEE Transactions on