DocumentCode :
1439591
Title :
Algorithm-based fault tolerance on a hypercube multiprocessor
Author :
Banerjee, Prithviraj ; Rahmeh, Joe T. ; Stunkel, Craig ; Nair, V.S. ; Roy, Kaushik ; Balasubramanian, Vijay ; Abraham, Jacob A.
Author_Institution :
Dept. of Electr. Eng., Illinois Univ., Urbana, IL, USA
Volume :
39
Issue :
9
fYear :
1990
fDate :
9/1/1990 12:00:00 AM
Firstpage :
1132
Lastpage :
1145
Abstract :
The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors
Keywords :
fault tolerant computing; multiprocessing systems; parallel architectures; Gaussian elimination; Intel iPSC hypercube; error detection; fast Fourier transform; fault tolerance; faulty processors; hypercube multiprocessor; matrix multiplication; multiprocessor architecture; Computer architecture; Computer errors; Costs; Fault detection; Fault diagnosis; Fault tolerance; Hypercubes; Jacobian matrices; Joining processes; Parallel architectures;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/12.57055
Filename :
57055
Link To Document :
بازگشت