Title :
Optimal diagnosis of heterogeneous systems with random faults
Author_Institution :
Dept. d´´Inf., Quebec Univ., Hull, Que., Canada
fDate :
3/1/1998 12:00:00 AM
Abstract :
We consider the problem of fault diagnosis in multiprocessor systems. Processors perform tests on one another; fault-free testers correctly identify the fault status of tested processors, while faulty testers can give arbitrary test results. Processors fail with arbitrary probabilities and all failures are independent. The goal is to identify correctly the status of all processors, based on the set of test results. A diagnosis algorithm is optimal if it has the highest probability of correctness (reliability) among all (deterministic) diagnosis algorithms. We give a fast diagnosis algorithm and prove its optimality for arbitrary values of failure probabilities. This is the first time that optimal diagnosis is given for systems without any assumptions on the behavior of faulty processors or on the values of failure probabilities. We also investigate locally optimal diagnosis algorithms: For any set of test results, they return the most probable configuration of faulty and fault-free processors that could yield it. We show a fast diagnosis which is always locally optimal. If all processors have failure probabilities smaller than ½, a locally optimal diagnosis is proved to be optimal. However, if some processors have failure probabilities exceeding ½, a locally optimal diagnosis need not have the highest reliability. We even show examples that it may have arbitrarily small reliability when the number of processors increases, while optimal reliability remains constant
Keywords :
deterministic algorithms; fault diagnosis; fault tolerant computing; multiprocessing systems; arbitrary test results; deterministic diagnosis algorithms; failure probabilities; fault diagnosis; heterogeneous systems; multiprocessor systems; optimal diagnosis; optimality; random faults; Bibliographies; Fault diagnosis; Fault tolerant systems; Multiprocessing systems; Performance evaluation; System testing;
Journal_Title :
Computers, IEEE Transactions on