Title :
Research on Online Failure Prediction Model and Status Pretreatment Method for Exascale System
Author :
Zhou, Hao ; Jiang, Yanhuang
Author_Institution :
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
The reliability issue of Exascale system is extremely serious. Traditional passive fault-tolerant methods, such as rollback-recovery, can not fully guarantee system reliability any more because of their large executing overhead and long recovering duration. Active fault tolerance is expected to become another important fault-tolerant approach for Exascale system. Focusing on system failure prediction, which is one key step of active fault tolerance, we construct online failure prediction model and research on the effective method of system status pretreatment. In order to improve the accuracy and real-time feature of current methods, the proposed Improved Adaptive Semantic Filter (IASF) method processes the latest system logs regularly, filtering useless information out of them according to their semantics. Adopting the main idea of Vector Space Model (VSM), IASF method creates Event Vector corresponding to each log record. By calculating the cosine of vectorial angle, it evaluates the semantics correlation between different log records, and then executes temporal and spatial redundant filter considering the burst feature of log records. IASF method is insensitive to the type of system log and does not introduce any expert system or domain knowledge. The experiment result shows that system can eliminate about 99.6% of useless log records after executing IASF method.
Keywords :
adaptive filters; fault tolerant computing; system monitoring; Exascale system; active fault tolerance; event vector; improved adaptive semantic filter method; online failure prediction model; passive fault-tolerant methods; reliability issue; rollback-recovery; spatial redundant filter; status pretreatment method; system failure prediction; system logs; system status pretreatment; temporal redundant filter; vector space model; vectorial angle cosine; Correlation; Fault tolerance; Fault tolerant systems; Information filters; Predictive models; Vectors; Exascale system; active fault tolerance; data mining; failure prediction; log process;
Conference_Titel :
Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1827-4
DOI :
10.1109/CyberC.2011.68