• DocumentCode
    2043109
  • Title

    A proactive fault-detection mechanism in large-scale cluster systems

  • Author

    Linping, Wu ; Dan, Meng ; Wen, Gao ; Jianfeng, Zhan

  • Author_Institution
    Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing
  • fYear
    2006
  • fDate
    25-29 April 2006
  • Abstract
    To improve the whole dependability of large-scale cluster systems, an online fault detection mechanism is proposed in this paper. This mechanism can detect the fault in time before node fails and enables the proactive fault management. The proposed mechanism is summarized as follows: first, the dynamic characteristics of cluster system running in normal activity are built using time series analysis methods. Second, the fault detection process is implemented by comparing the current running state of cluster system with normal running model. The fault alarm decision is made immediately when the current running state deviates the normal running model. The experiment results show that this mechanism can detect the fault in cluster system in good time
  • Keywords
    fault tolerant computing; telecommunication network management; time series; workstation clusters; fault alarm decision; large-scale cluster systems; online fault detection mechanism; proactive fault management; proactive fault-detection mechanism; time series analysis; Aging; Computers; Fault detection; Hard disks; Large-scale systems; Monitoring; Operating systems; Power system management; Testing; Time series analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
  • Conference_Location
    Rhodes Island
  • Print_ISBN
    1-4244-0054-6
  • Type

    conf

  • DOI
    10.1109/IPDPS.2006.1639332
  • Filename
    1639332