DocumentCode :
2166264
Title :
Group communication protocols under errors
Author :
Basile, Claudio ; Wang, Long ; Kalbarczyk, Zbigniew ; Iyer, Ravi
Author_Institution :
Center for Reliable & High-Performance Comput., Univ. of Illinois, Urbana-Champaign, IL, USA
fYear :
2003
fDate :
6-18 Oct. 2003
Firstpage :
35
Lastpage :
44
Abstract :
Group communication protocols constitute a basic building block for highly dependable distributed applications. Designing and correctly implementing a group communication system (GCS) is a difficult task. While many theoretical algorithms have been formalized and proved for correctness, only few research projects have experimentally assessed the dependability of GCS implementations under complex error scenarios. This paper describes a thorough error-injection experimental campaign conducted on Ensemble, a popular GCS. By employing synthetic benchmark applications, we stress selected components of the GCS $the group membership service, the FIFO-ordered reliable multicast - under various error models, including errors in the memory (text and heap segments) and in the network messages. The data show that about 5-6% of the failures are due to an error escaping Ensemble´s error-containment mechanism and manifesting as a fail silence violation. This constitutes an impediment to achieving high dependability, the natural objective of GCSs. Our results are derived for a particular system (Ensemble), and more investigation involving other GCSs is required to generalize the conclusions. Nevertheless, through an accurate analysis of the failure causes and the error propagation patterns, this paper offers insights into the design and the implementation of robust GCSs.
Keywords :
error detection; fault tolerant computing; formal verification; multicast communication; protocols; Ensemble; GCS; error models; error propagation patterns; error-injection experiment; fail silence violation; group communication protocols; group communication system; synthetic benchmark applications; Communication systems; Computer crashes; Distributed computing; Error correction; Failure analysis; Formal verification; Impedance; Pattern analysis; Protocols; Stress;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Reliable Distributed Systems, 2003. Proceedings. 22nd International Symposium on
ISSN :
1060-9857
Print_ISBN :
0-7695-1955-5
Type :
conf
DOI :
10.1109/RELDIS.2003.1238053
Filename :
1238053
Link To Document :
بازگشت