مرکز منطقه ای اطلاع رساني علوم و فناوري - An algorithm for distributed hierarchical diagnosis of dynamic fault and repair events

DocumentCode :

2257550

Title :

An algorithm for distributed hierarchical diagnosis of dynamic fault and repair events

Author :

Duarte, Elias Procópio, Jr. ; Brawerman, Alessandro ; Albini, Luiz Carlos P

Author_Institution :

Dept. Inf., Fed. Univ. of Parana, Curitiba, Brazil

fYear :

2000

fDate :

2000

Firstpage :

299

Lastpage :

306

Abstract :

The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully-connected networks. An event is defined as a faulty node becoming fault-free, or vice versa. Previous hierarchical algorithms considered a static fault situation, in which an event can only occur after a previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as the nodes stay in a given state for a period of time long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may obtain information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation

Keywords :

distributed algorithms; fault diagnosis; fault tolerant computing; counter; distributed algorithm; distributed hierarchical diagnosis; dynamic events; dynamic fault events; dynamic repair events; event diagnosis; fault-free nodes; fault-tolerant distributed system; faulty component determination; fully-connected networks; hierarchical algorithms; hierarchical testing strategy; hypercube; latency; node state; overhead; timestamp; Adaptive systems; Counting circuits; Delay; Distributed algorithms; Event detection; Fault diagnosis; Fault tolerant systems; Informatics; Local area networks; System testing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Systems, 2000. Proceedings. Seventh International Conference on

Conference_Location :

Iwate

ISSN :

1521-9097

Print_ISBN :

0-7695-0568-6

Type :

conf

DOI :

10.1109/ICPADS.2000.857711

Filename :

857711

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2257550