• DocumentCode
    36283
  • Title

    Adaptive Algorithms for Diagnosing Large-Scale Failures in Computer Networks

  • Author

    Tati, Srikar ; Bong Jun Ko ; Guohong Cao ; Swami, Ananthram ; La Porta, Thomas F.

  • Author_Institution
    Inst. for Networking & Security Res., Pennsylvania State Univ., University Park, PA, USA
  • Volume
    26
  • Issue
    3
  • fYear
    2015
  • fDate
    Mar-15
  • Firstpage
    646
  • Lastpage
    656
  • Abstract
    We propose a greedy algorithm, Cluster-MAX-COVERAGE (CMC), to efficiently diagnose large-scale clustered failures. We primarily address the challenge of determining faults with incomplete symptoms. CMC makes novel use of both positive and negative symptoms to output a hypothesis list with a low number of false negatives and false positives quickly. CMC requires reports from about half as many nodes as other existing algorithms to determine failures with 100 percent accuracy. Moreover, CMC accomplishes this gain significantly faster (sometimes by two orders of magnitude) than an algorithm that matches its accuracy. When there are fewer positive and negative symptoms at a reporting node, CMC performs much better than existing algorithms. We also propose an adaptive algorithm called Adaptive-MAX-COVERAGE (AMC) that performs efficiently during both independent and clustered failures. During a series of failures that include both independent and clustered, AMC results in a reduced number of false negatives and false positives.
  • Keywords
    computer network reliability; fault diagnosis; greedy algorithms; large-scale systems; workstation clusters; AMC; CMC; adaptive algorithms; adaptive-MAX-COVERAGE; cluster-MAX-COVERAGE; false negatives; false positives; greedy algorithm; large-scale clustered failure diagnosis; Accuracy; Adaptive algorithms; Clustering algorithms; Complexity theory; Computer networks; Fault diagnosis; Network topology; Fault diagnosis; clustered failures; incomplete information; large-scale failures;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2014.2311814
  • Filename
    6767126