• DocumentCode
    330834
  • Title

    Reliability analysis of clustered computing systems

  • Author

    Mendiratta, Veena B.

  • Author_Institution
    AT&T Bell Labs., Naperville, IL, USA
  • fYear
    1998
  • fDate
    4-7 Nov 1998
  • Firstpage
    268
  • Lastpage
    272
  • Abstract
    Clustered computing systems, using commercially available computers networked in a loosely-coupled fashion, can provide high levels of reliability if appropriate levels of error detection and recovery software are implemented in the middleware and application layers. In this paper, we present a modeling approach for analyzing the hardware and software reliability of clustered computing systems. The clustered system is modeled as an irreducible Markov chain with working and failed states, and intermediate recovery states. The failure and recovery behavior is characterized in terms of the frequency and duration of fault recoveries and outages for a single processor in the cluster and for the entire clustered system. We apply the model to a telecommunication switching system application that uses the Lucent Technologies Reliable Clustered Computing product. The model results are presented for a range of values of the processor failure rate and the fault recovery coverage factor
  • Keywords
    Markov processes; client-server systems; computer network reliability; electronic switching systems; error detection; software reliability; switching networks; system recovery; telecommunication computing; workstation clusters; Lucent Technologies Reliable Clustered Computing product; application layers; clustered computing systems reliability; commercially available computers; error detection; error recovery software; failed states; failure behavior; fault recovery behavior; fault recovery coverage factor; hardware reliability; intermediate recovery states; irreducible Markov chain; loosely-coupled computer network; middleware; modeling; networked computers; outages; processor failure rate; software reliability; telecommunication switching system; working states; Application software; Computer errors; Computer network reliability; Computer networks; Frequency; Hardware; Middleware; Software reliability; Telecommunication computing; Telecommunication switching;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Reliability Engineering, 1998. Proceedings. The Ninth International Symposium on
  • Conference_Location
    Paderborn
  • ISSN
    1071-9458
  • Print_ISBN
    0-8186-8991-9
  • Type

    conf

  • DOI
    10.1109/ISSRE.1998.730890
  • Filename
    730890