Abstract :
The statistical methods used to collect and analyze fault-recovery data affect directly the credibility of reliability estimation. To provide data on which to base the development of sampling methods and parameter estimation techniques, pin-level fault-injection was conducted on the FTMP computer. Detection time was chosen for statistical analysis because it accounted for most of the variation in total recovery time. Stuck-at-zero, stuck-at-one, and inverted faults were injected on each of six pins, yielding 18 data sets. The data sets fell into groups of detection behavior; however, none of the factors that were varied in the experiment¿fault type, pin, chip, or board¿acounted for the groupings. While no single distribution was shown to be the best fit to all the data sets, of greater importance is that the exponential distribution was a bad fit to all data sets. This refutes a common assumption of reliability modeling that detection times are exponentially distributed. These results suggest that stratified random sampling methods and statistically robust parameter estimation techniques are required to characterize fault detection time. Further experimentation is planned to discover the sources of the variation in detection time.
Keywords :
Fault detection; Fault diagnosis; Fault tolerance; Fault tolerant systems; High performance computing; Life estimation; Military computing; Sampling methods; Statistical analysis; Statistical distributions; Fault detection; Fault injection; Fault recovery; Parameter estimation; Reliability modeling; Sampling methods; Statistical distributions;