DocumentCode
1686362
Title
Adaptive fault tolerance
Author
Goldberg, Jack ; Greenberg, Ira ; Lawrence, Thomas F.
Author_Institution
SRI Int., Menlo Park, CA, USA
fYear
1993
fDate
10/6/1993 12:00:00 AM
Firstpage
127
Lastpage
132
Abstract
The goal of adaptive fault tolerance (AFT) is to expand the envelope of dependable system operation in distributed, real-time systems. Such systems often experience substantial run-time changes in the types and distributions of faults, in the availability of resources, in data distribution, and in users\´ requirements for dependability and performance. Preliminary examples, such as Adaptable Distributed Recovery Blocks (Kim) and distributed crash recovery, illustrate how adaptive fault tolerance can provide useful tradeoffs among service properties such as error-recovery latency, throughput, and precision, over a wide range of operating conditions. A general methodology for AFT system design must address issues of (1) rapid, incremental diagnosis/estimation of environmental and internal state, (2) safe and effective control, and (3) efficient, parametric or multimode fault-tolerant implementations. A major challenge is to achieve the additional flexibility without excessive complexity, both for performance and reliability concerns. Reflective architecture, a form of meta-design, is an attractive framework for AFT system design and for adaptive systems in general. It provides for the monitoring and redefinition of system behavior in a hierarchical manner that may be integrated with conventional "uses-based hierarchical design
Keywords
distributed processing; fault tolerant computing; real-time systems; Adaptable Distributed Recovery Blocks; adaptive fault tolerance; adaptive systems; complexity; dependable system operation; distributed crash recovery; distributed systems; error-recovery latency; performance; precision; real-time systems; reliability; throughput; Availability; Computer crashes; Delay; Fault diagnosis; Fault tolerance; Fault tolerant systems; Real time systems; Runtime; State estimation; Throughput;
fLanguage
English
Publisher
ieee
Conference_Titel
Advances in Parallel and Distributed Systems, 1993., Proceedings of the IEEE Workshop on
Conference_Location
Princeton, NJ
Print_ISBN
0-8186-5250-0
Type
conf
DOI
10.1109/APADS.1993.588861
Filename
588861
Link To Document