DocumentCode :
2257550
Title :
An algorithm for distributed hierarchical diagnosis of dynamic fault and repair events
Author :
Duarte, Elias Procópio, Jr. ; Brawerman, Alessandro ; Albini, Luiz Carlos P
Author_Institution :
Dept. Inf., Fed. Univ. of Parana, Curitiba, Brazil
fYear :
2000
fDate :
2000
Firstpage :
299
Lastpage :
306
Abstract :
The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully-connected networks. An event is defined as a faulty node becoming fault-free, or vice versa. Previous hierarchical algorithms considered a static fault situation, in which an event can only occur after a previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as the nodes stay in a given state for a period of time long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may obtain information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation
Keywords :
distributed algorithms; fault diagnosis; fault tolerant computing; counter; distributed algorithm; distributed hierarchical diagnosis; dynamic events; dynamic fault events; dynamic repair events; event diagnosis; fault-free nodes; fault-tolerant distributed system; faulty component determination; fully-connected networks; hierarchical algorithms; hierarchical testing strategy; hypercube; latency; node state; overhead; timestamp; Adaptive systems; Counting circuits; Delay; Distributed algorithms; Event detection; Fault diagnosis; Fault tolerant systems; Informatics; Local area networks; System testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems, 2000. Proceedings. Seventh International Conference on
Conference_Location :
Iwate
ISSN :
1521-9097
Print_ISBN :
0-7695-0568-6
Type :
conf
DOI :
10.1109/ICPADS.2000.857711
Filename :
857711
Link To Document :
بازگشت