• DocumentCode
    2903693
  • Title

    AFD: Adaptive failure detection system for cloud computing infrastructures

  • Author

    Pannu, H.S. ; Jianguo Liu ; Qiang Guan ; Song Fu

  • Author_Institution
    Dept. of Math., Univ. of North Texas, Denton, TX, USA
  • fYear
    2012
  • fDate
    1-3 Dec. 2012
  • Firstpage
    71
  • Lastpage
    80
  • Abstract
    Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructure. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software failures. Autonomic failure detection is a crucial technique for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect failures, we need to monitor the cloud execution and collect runtime performance data. These data are usually unlabeled, and thus a prior failure history is not always available in production clouds, especially for newly managed or deployed systems. In this paper, we present an Adaptive Failure Detection (AFD) framework for cloud dependability assurance. AFD employs data description using hypersphere for adaptive failure detection. Based on the cloud performance data, AFD detects possible failures, which are verified by the cloud operators. They are confirmed as either true failures with failure types or normal states. AFD adapts itself by recursively learning from these newly verified detection results to refine future detections. Meanwhile, AFD exploits the observed but undetected failure records reported by the cloud operators to identify new types of failures. We have implemented a prototype of the AFD system and conducted experiments in an on-campus cloud computing environment. Our experimental results show that AFD can achieve more efficient and accurate failure detection than other existing schemes.
  • Keywords
    cloud computing; fault diagnosis; fault tolerant computing; learning (artificial intelligence); AFD framework; adaptive failure detection system; autonomic failure detection; cloud execution monitoring; cloud operators; data description; hardware failures; hypersphere; normal states; on-campus cloud computing environment; production cloud computing system infrastructures; recursive learning; runtime cloud performance data collection; self-managing cloud resources; software failures; system-level cloud dependability assurance; true failure types; unlabeled data; Cloud computing; Detectors; Equations; Kernel; Measurement; Servers; Virtual machining; Autonomic management; Cloud computing; Dependable systems; Failure detection; Learning algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Performance Computing and Communications Conference (IPCCC), 2012 IEEE 31st International
  • Conference_Location
    Austin, TX
  • ISSN
    1097-2641
  • Print_ISBN
    978-1-4673-4881-2
  • Type

    conf

  • DOI
    10.1109/PCCC.2012.6407740
  • Filename
    6407740