Title :
A fault tolerance infrastructure for dependable computing with high-performance COTS components
Author_Institution :
A. Avizienis & Assoc. Inc., Santa Monica, CA, USA
Abstract :
The failure rates of current COTS processors have dropped to 100 FITs (failures per 10/sup 9/ hours), indicating a potential MTTF of over 1100 years. However our recent study of Intel P6 family processors has shown that they have very limited error detection and recovery capabilities and contain numerous design faults ("errata"). Other limitations are susceptibility to transient faults and uncertainty about "wearout" that could increase the failure rate in time. Because of these limitations, an external fault tolerance infrastructure is needed to assure the dependability of a system with such COTS components. The paper describes a fault-tolerant "infrastructure" system of fault tolerance functions that makes possible the use of low-coverage COTS processors in a fault-tolerant, self-repairing system. The custom hardware supports transient recovery design fault tolerance, and self-repair by scaring and replacement. Fault tolerance functions are implemented by four types of hardware are processors of low complexity that are fault-tolerant. High error detection coverage, including design faults, is attained by diversity and replication.
Keywords :
"Fault tolerance","Fault detection","Hardware","Uncertainty","Error correction","Software design","Semiconductor devices","Logic devices","Logic design","Environmental factors"
Conference_Titel :
Dependable Systems and Networks, 2000. DSN 2000. Proceedings International Conference on
Print_ISBN :
0-7695-0707-7
DOI :
10.1109/ICDSN.2000.857581