مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

1840592

Title :

The φ accrual failure detector

Author :

Hayashibara, Naohiro ; Défago, Xavier ; Yared, Rami ; Katayama, Takuya

Author_Institution :

Sch. of Inf. Sci., Japan Adv. Inst. of Sci. & Technol., Ishikawa, Japan

fYear :

2004

fDate :

18-20 Oct. 2004

Firstpage :

Lastpage :

Abstract :

The detection of failures is a fundamental issue for fault-tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously. We present a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a binary nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale. The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment. In this paper, we describe an implementation of such an accrual failure detector, that we call the φ failure detector. The particularity of the φ failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our φ failure detector over an intercontinental communication link over a week. Our experimental results show that if performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility.

Keywords :

data communication; distributed processing; fault diagnosis; fault tolerant computing; system recovery; φ accrual failure detector; adaptive mechanism; application requirements; continuous information provision; distributed systems; failure detection; fault tolerance; intercontinental communication link; monitoring; suspicion level; Computer crashes; Condition monitoring; Detectors; Event detection; Failure analysis; Fault detection; Fault tolerant systems; H infinity control; Information science; Quality of service;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Reliable Distributed Systems, 2004. Proceedings of the 23rd IEEE International Symposium on

ISSN :

1060-9857

Print_ISBN :

0-7695-2239-4

Type :

conf

DOI :

10.1109/RELDIS.2004.1353004

Filename :

1353004

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1840592