• DocumentCode
    1179410
  • Title

    A time redundancy approach to TMR failures using fault-state likelihoods

  • Author

    Shin, Kang G. ; Kim, Hagbae

  • Author_Institution
    Real-Time Comput. Lab., Michigan Univ., Ann Arbor, MI, USA
  • Volume
    43
  • Issue
    10
  • fYear
    1994
  • fDate
    10/1/1994 12:00:00 AM
  • Firstpage
    1151
  • Lastpage
    1162
  • Abstract
    Failure to establish a majority among the processing modules in a triple modular redundant (TMR) system, called a TMR failure, is detected by using two voters and a disagreement detector. Assuming that no more than one module becomes permanently faulty during the execution of a task, Re-execution of the task on the Same HardWare (RSHW) upon detection of a TMR failure becomes a cost-effective recovery method, because 1) the TMR system can mask the effects of one faulty module while RSHW can recover from nonpermanent faults, and 2) system reconfiguration-Replace the faulty HardWare, reload, and Restart (RHWR)-is expensive both in time and hardware. We propose an adaptive recovery method for TMR failures by “optimally” choosing either RSHW or RHWR based on the estimation of the costs involved. We apply the Bayes theorem to update the likelihoods of all possible states in the TMR system with each voting result. Upon detection of a TMR failure, the expected cost of RSHW is derived with these likelihoods and then compared with that of RHWR. RSHW will continue either until it recovers from the TMR failure or until the expected cost of RSHW becomes larger than that of RHWR. As the number of unsuccessful RSHW´s increases, the probability of permanent fault(s) having caused the TMR failure will increase, which will, in turn, increase the cost of RSHW. Our simulation results show that the proposed method outperforms the conventional reconfiguration method using only RHWR under various conditions
  • Keywords
    Bayes methods; digital simulation; fault tolerant computing; redundancy; Bayes theorem; TMR failures; adaptive recovery method; disagreement detector; fault-state likelihoods; processing modules; simulation results; system reconfiguration; time redundancy approach; triple modular redundant system; voters; Clocks; Costs; Detectors; Electrical fault detection; Fault detection; Fault tolerance; Hardware; NASA; Redundancy; Voting;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/12.324541
  • Filename
    324541