• DocumentCode
    936402
  • Title

    Distributed diagnosis in dynamic fault environments

  • Author

    Subbiah, Arun ; Blough, Douglas M.

  • Author_Institution
    Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA, USA
  • Volume
    15
  • Issue
    5
  • fYear
    2004
  • fDate
    5/1/2004 12:00:00 AM
  • Firstpage
    453
  • Lastpage
    467
  • Abstract
    The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about state changes of nodes in the system reaches working nodes with a bounded delay, bounded start-up time, which guarantees that working nodes determine valid states for every other node in the system within bounded time after their recovery, and accuracy, which ensures that no spurious events are recorded by working nodes. It is shown that, in order to achieve bounded correctness, the rate at which nodes fail and are repaired must be limited. This requirement is quantified by defining a minimum state holding time in the system. Algorithm heartbeatcomplete is presented and it is proven that this algorithm achieves bounded correctness in fully-connected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. A diagnosis algorithm for arbitrary topologies, known as algorithm forwardheartbeat, is also presented. Forwardheartbeat is shown to produce significantly shorter latency and state holding time than prior algorithms, which focused primarily on minimizing the number of tests at the expense of latency.
  • Keywords
    computational complexity; distributed processing; fault diagnosis; fault tolerant computing; minimisation; synchronisation; system recovery; bounded correctness; bounded delay; bounded diagnostic latency; bounded start-up time; distributed diagnosis; dynamic fault environments; fault tolerance; forwardheartbeat algorithm; heartbeatcomplete algorithm; synchronous systems; system recovery; Delay effects; Fault diagnosis; Fault tolerant systems; Performance evaluation; System testing; Topology;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2004.1278102
  • Filename
    1278102