DocumentCode :
1196379
Title :
Evaluation of Software-Implemented Fault-Tolerance (SIFT) Approach in Gracefully Degradable Multi-Computer Systems
Author :
Avresky, Dimiter R. ; Geoghegan, Sean J. ; Varoglu, Yavuz
Author_Institution :
Dept. of Electr. & Comput. Eng., Northeastern Univ., Boston, MA
Volume :
55
Issue :
3
fYear :
2006
Firstpage :
451
Lastpage :
457
Abstract :
This paper presents an analytical method for evaluating the reliability improvement for any size of multi-computer system based on Software-Implemented Fault-Tolerance (SIFT). The method is based on the equivalent failure rate Gamma, the single node failure rate lambda, the number of nodes in the system, N, the repair rate mu, the fault coverage factor c, the reconfiguration rate delta, and the percentage of blocking faults b1 and b2. The impact of these parameters on the reliability improvement has been evaluated for a gracefully degradable multi-computer system using our proposed analytical technique based on Markov chains. To validate our approach, we used the SIFT method which implements error detection at the node level, combined with a fast reconfiguration algorithm for avoiding faulty nodes. It is worth noting that the proposed method is applicable to any multi-computer systems´ topology. The evaluation work presented in this paper focuses on the combination of analytical and experimental approaches, and more precisely on Markov chains. The SIFT method has been successfully implemented for a multi-computer system, nCube. The time overhead (reconfiguration & recomputation time) incurred by the injected fault, and the fault coverage factor c, are experimentally evaluated by means of a parallel version of the Software Object-Oriented Fault-Injection Tool (nSOFIT). The implemented SIFT approach can be used for real-time applications, when the time constraints should be met despite failures in the gracefully degradable multi-computer system
Keywords :
Markov processes; error detection; object-oriented programming; real-time systems; software fault tolerance; system recovery; Markov chains; SIFT method; error detection; fast reconfiguration algorithm; multi-computer system; real-time application; software object-oriented fault-injection tool; software-implemented fault-tolerance; Central Processing Unit; Degradation; Fault detection; Fault tolerant systems; Object oriented modeling; Operating systems; Real time systems; Software tools; Topology; Upper bound; Fault tolerance; Markov chain; graceful degradation; mean time to failure; multi-computers; reconfiguration; reliability improvement;
fLanguage :
English
Journal_Title :
Reliability, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9529
Type :
jour
DOI :
10.1109/TR.2006.879663
Filename :
1688080
Link To Document :
بازگشت