• DocumentCode
    3516260
  • Title

    Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring

  • Author

    Jiang, Miao ; Munawar, Mohammad A. ; Reidemeister, Thomas ; Ward, Paul A S

  • Author_Institution
    E&CE Dept., Univ. of Waterloo, Waterloo, ON, Canada
  • fYear
    2009
  • fDate
    June 29 2009-July 2 2009
  • Firstpage
    285
  • Lastpage
    294
  • Abstract
    Management metrics of complex software systems exhibit stable correlations which can enable fault detection and diagnosis. Current approaches use specific analytic forms, typically linear, for modeling correlations. In this paper we use normalized mutual information as a similarity measure to identify clusters of correlated metrics, without knowing the specific form. We show how we can apply the Wilcoxon rank-sum test to identify anomalous behaviour. We present two diagnosis algorithms to locate faulty components: RatioScore, based on the Jaccard coefficient, and SigScore, which incorporates knowledge of component dependencies. We evaluate our mechanisms in the context of a complex enterprise application. Through fault injection experiments, we show that we can detect 17 out of 22 faults without any false positives. We diagnose the faulty component in the top five anomaly scores 7 times out of 17 using SigScore, which is 40% better than when system structure is ignored.
  • Keywords
    fault diagnosis; fault tolerant computing; information theory; software maintenance; statistical analysis; Jaccard coefficient; RatioScore component; SigScore component; Wilcoxon rank-sum test; anomalous behaviour identification; automatic fault detection system; complex enterprise application; complex software system; fault diagnosis; information theoretic monitoring; management metrics; normalized mutual information; Automatic testing; Clustering algorithms; Computerized monitoring; Entropy; Fault detection; Fault diagnosis; Fault location; Information theory; Predictive models; Software systems; fault detection and diagnosis; information theory; self-managing systems; statistical techniques;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems & Networks, 2009. DSN '09. IEEE/IFIP International Conference on
  • Conference_Location
    Lisbon
  • Print_ISBN
    978-1-4244-4422-9
  • Electronic_ISBN
    978-1-4244-4421-2
  • Type

    conf

  • DOI
    10.1109/DSN.2009.5270324
  • Filename
    5270324