Title :
Design of Algorithm-Based Fault Tolerant Systems with In-System Checks
Author :
Yajnik, Shalini ; Jha, Niraj K.
Author_Institution :
Princeton University
Abstract :
To improve the reliability of computeintensive applications run on multiprocessor architec tures, fault tolerance is introduced into the system with on-line detection and location of faults. This can be achieved by a low-cost scheme, called Algorithm-based fault tolerance (ABFT), which encodes data at the system level and modifies the algorithm to operate on the encoded data. The resultant encoded output data is checked for correctness by some checks. In this pa per we present an extended model for representing and designing ABFT systems. The model takes into con sideration the processors evaluating the checks. We propose a design method which considers the proces sors computing the checks to be a part of the ABFT system and guarantees concurrent error detection even in the presence of faults in these processors, unlike most methods presented earlier.
Keywords :
Algorithm design and analysis; Computer applications; Concurrent computing; Design methodology; Electrical fault detection; Fault detection; Fault tolerant systems; Multiprocessing systems; Parallel processing; Process design;
Conference_Titel :
Parallel Processing, 1993. ICPP 1993. International Conference on
Conference_Location :
Syracuse, NY, USA
Print_ISBN :
0-8493-8983-6
DOI :
10.1109/ICPP.1993.70