Author_Institution :
Massachusetts Univ., Amherst, MA, USA
Abstract :
A processor is any self-contained computer of at least personal-computer capability. The paper explores how much the processor mean time-to-failure can be improved by replacing it with an N-processor module, where each processor in the module consists of a copy of the original processor augmented with a communication protocol unit. The copy of the original processor is faulty with probability, pc, and the protocol unit is faulty with probability, p. The asynchronous N-processor module uses a Byzantine agreement (F-ID-P) algorithm to identify which of its processors disagreed with a module consensus. The identified processors are presumed faulty, and the module replaces them with duplicates from a set of standbys. The F-ID-P algorithm is a modification of Bracha´s, which guarantees that in a module of 3t+1 processors, up to t faults can be identified by at least t+1 non-faulty processors. The module fails if faults in more than t of its processors prevent it from: 1) obtaining a correct consensus, or 2) executing the algorithm. The F-ID-P algorithm departs from Bracha´s by using a random instead of an adversary scheduler of message delays. Simulation showed that almost always F-ID-P algorithm correctly identified all of a module´s faulty processors if more than half of them were nonfaulty. Thus F-ID-P algorithm was about 3/2 more fault tolerant than guaranteed. Also, compared to a single processor´s mean number of decisions to failure, the F-ID-P module was 841 times better when N=37, down to 5.1 times better when N=10
Keywords :
failure analysis; fault tolerant computing; probability; protocols; redundancy; reliability; Byzantine agreement algorithm; F-ID-P algorithm; asynchronous N-processor module; communication protocol unit; fault tolerance; mean time-to-failure; message delays; personal-computer; probability; random scheduler; self-contained computer; self-repairing processor modules; Broadcasting; Computer networks; Delay effects; Fault diagnosis; Fault tolerance; Military computing; Processor scheduling; Protocols; Redundancy; Scheduling algorithm;