• DocumentCode
    2587646
  • Title

    Assessing the crash-failure assumption of group communication protocols

  • Author

    Mena, Sergio ; Basile, Claudio ; Kalbarczyk, Zbigniew ; Schiper, André ; Iyer, Ravi K.

  • Author_Institution
    Ecole Polytechnique Fed. de Lausanne
  • fYear
    2005
  • fDate
    1-1 Nov. 2005
  • Lastpage
    116
  • Abstract
    Designing and correctly implementing group communication systems (GCSs) is notoriously difficult. Assuming that processes fail only by crashing provides a powerful means to simplify the theoretical development of these systems. When making this assumption, however, one should not forget that clean crash failures provide only a coarse approximation of the effects that errors can have in distributed systems. Ignoring such a discrepancy can lead to complex GCS-based applications that pay a large price in terms of performance overhead yet fail to deliver the promised level of dependability. This paper provides a thorough study of error effects in real systems by demonstrating an error-injection-driven design methodology, where error injection is integrated in the core steps of the design process of a robust fault-tolerant system. The methodology is demonstrated for the Fortika toolkit, a Java-based GCS. Error injection enables us to uncover subtle reliability bottlenecks both in the design of Fortika and in the implementation of Java. Based on the obtained insights, we enhance Fortika´s design to reduce the identified bottlenecks. Finally, a comparison of the results obtained for Fortika with the results obtained for the OCAML-based Ensemble system in a previous work, allows us to investigate the reliability implications that the choice of the development platform (Java versus OCAML) can have
  • Keywords
    Java; distributed processing; failure analysis; groupware; program testing; software fault tolerance; Fortika toolkit; Java; OCAML-based Ensemble system; crash-failure assumption; dependability; distributed systems; error-injection-driven design; fault tolerant system; group communication protocols; group communication systems; reliability bottlenecks; system error effects; Broadcasting; Communication systems; Computer crashes; Design methodology; Fault tolerant systems; Java; Power system reliability; Process design; Protocols; Robustness;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Reliability Engineering, 2005. ISSRE 2005. 16th IEEE International Symposium on
  • Conference_Location
    Chicago, IL
  • ISSN
    1071-9458
  • Print_ISBN
    0-7695-2482-6
  • Type

    conf

  • DOI
    10.1109/ISSRE.2005.9
  • Filename
    1544726