DocumentCode :
1686362
Title :
Adaptive fault tolerance
Author :
Goldberg, Jack ; Greenberg, Ira ; Lawrence, Thomas F.
Author_Institution :
SRI Int., Menlo Park, CA, USA
fYear :
1993
fDate :
10/6/1993 12:00:00 AM
Firstpage :
127
Lastpage :
132
Abstract :
The goal of adaptive fault tolerance (AFT) is to expand the envelope of dependable system operation in distributed, real-time systems. Such systems often experience substantial run-time changes in the types and distributions of faults, in the availability of resources, in data distribution, and in users\´ requirements for dependability and performance. Preliminary examples, such as Adaptable Distributed Recovery Blocks (Kim) and distributed crash recovery, illustrate how adaptive fault tolerance can provide useful tradeoffs among service properties such as error-recovery latency, throughput, and precision, over a wide range of operating conditions. A general methodology for AFT system design must address issues of (1) rapid, incremental diagnosis/estimation of environmental and internal state, (2) safe and effective control, and (3) efficient, parametric or multimode fault-tolerant implementations. A major challenge is to achieve the additional flexibility without excessive complexity, both for performance and reliability concerns. Reflective architecture, a form of meta-design, is an attractive framework for AFT system design and for adaptive systems in general. It provides for the monitoring and redefinition of system behavior in a hierarchical manner that may be integrated with conventional "uses-based hierarchical design
Keywords :
distributed processing; fault tolerant computing; real-time systems; Adaptable Distributed Recovery Blocks; adaptive fault tolerance; adaptive systems; complexity; dependable system operation; distributed crash recovery; distributed systems; error-recovery latency; performance; precision; real-time systems; reliability; throughput; Availability; Computer crashes; Delay; Fault diagnosis; Fault tolerance; Fault tolerant systems; Real time systems; Runtime; State estimation; Throughput;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advances in Parallel and Distributed Systems, 1993., Proceedings of the IEEE Workshop on
Conference_Location :
Princeton, NJ
Print_ISBN :
0-8186-5250-0
Type :
conf
DOI :
10.1109/APADS.1993.588861
Filename :
588861
Link To Document :
بازگشت