DocumentCode :
1053808
Title :
On the implementation of unreliable failure detectors in partially synchronous systems
Author :
Larrea, Mikel ; Fernández, Antonio ; Arévalo, Sergio
Author_Institution :
Dept. de Arquitectura y Tecnologia de Comput., Pais Vasco Univ., San Sebastian, Spain
Volume :
53
Issue :
7
fYear :
2004
fDate :
7/1/2004 12:00:00 AM
Firstpage :
815
Lastpage :
828
Abstract :
Unreliable failure detectors were proposed by Chandra and Toueg as mechanisms that provide information about process failures. Chandra and Toueg defined eight classes of failure detectors, depending on how accurate this information is, and presented an algorithm implementing a failure detector of one of these classes in a partially synchronous system. This algorithm is based on all-to-all communication and periodically exchanges a number of messages that is quadratic on the number of processes. We study the implementability of different classes of failure detectors in several models of partial synchrony. We first show that no failure detector with perpetual accuracy (namely, P, Q, S, and W) can be implemented in these models in systems with even a single failure. We also show that, in these models of partial synchrony, it is necessary a majority of correct processes to implement a failure detector of the class θ proposed by Aguilera et al. Then, we present a family of distributed algorithms that implement the four classes of unreliable failure detectors with eventual accuracy (namely, P, Q, S, and W). Our algorithms are based on a logical ring arrangement of the processes, which defines the monitoring and failure information propagation pattern. The resulting algorithms periodically exchange at most a linear number of messages.
Keywords :
computational complexity; distributed algorithms; system recovery; all-to-all communication; consensus problem; crash failures; distributed algorithms; distributed systems; failure information propagation pattern; logical ring arrangement; partially synchronous system; process failures; unreliable failure detector; Algorithm design and analysis; Computational modeling; Computer crashes; Condition monitoring; Detectors; Distributed algorithms; Distributed computing; Helium; Synchronous generators; Timing; 65; Consensus problem; crash failures; distributed systems; failure detection; partial synchrony; unreliable failure detectors.;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2004.33
Filename :
1321043
Link To Document :
بازگشت