• DocumentCode
    2888335
  • Title

    Anomaly localization in large-scale clusters

  • Author

    Zheng, Ziming ; Li, Yawei ; Lan, Zhiling

  • Author_Institution
    Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
  • fYear
    2007
  • fDate
    17-20 Sept. 2007
  • Firstpage
    322
  • Lastpage
    330
  • Abstract
    A critical problem facing by managing large-scale clusters is to identify the location of problems in a system in case of unusual events. As the scale of high performance computing (HPC) grows, systems are getting bigger. When a system fails to function properly, health-related data are collected for troubleshooting. However, due to the massive quantities of information obtained from a large number of components, the root causes of anomalies are often buried like needles in a haystack. In this paper, we present a localization method to automatically find out the potential root causes (i.e. a subset of nodes) of the problem from the overwhelming amount of data collected system-wide. System managers can focus on examining these potential locations, thereby significantly reducing human efforts required for anomaly localization. Our method consists of three interrelated steps: (1) feature collection to assemble a feature space for the system; (2) feature extraction to obtain the most significant features for efficient data analysis by applying the principal component analysis (PCA) algorithm; and (3) outlier detection to quickly identify the nodes that are ldquofar awayrdquo from the majority by using the cell-based detection algorithm. Preliminary studies are presented to demonstrate the potential of our method for localizing anomalies in a computing environment where the nodes perform comparable tasks.
  • Keywords
    feature extraction; principal component analysis; workstation clusters; anomaly localization; cell-based detection algorithm; data analysis; feature collection; feature extraction; high performance computing; large-scale clusters; outlier detection; principal component analysis algorithm;; root causes; Assembly systems; Data analysis; Detection algorithms; Feature extraction; High performance computing; Humans; Large-scale systems; Needles; Predictive models; Principal component analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing, 2007 IEEE International Conference on
  • Conference_Location
    Austin, TX
  • ISSN
    1552-5244
  • Print_ISBN
    978-1-4244-1387-4
  • Electronic_ISBN
    1552-5244
  • Type

    conf

  • DOI
    10.1109/CLUSTR.2007.4629246
  • Filename
    4629246