DocumentCode :
3454833
Title :
Proactive detection of software aging mechanisms in performance critical computers
Author :
Gross, Kenny C. ; Bhardwaj, Vatsal ; Bickford, Randy
Author_Institution :
Sun Microsystems, USA
fYear :
2002
fDate :
5-6 Dec. 2002
Firstpage :
17
Lastpage :
23
Abstract :
Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990 \´s the U.S. Dept. of Energy and NASA funded development of an advanced statistical pattern recognition method called the multivariate state estimation technique (MSET) for proactive online detection of dynamic sensor and signal anomalies in nuclear power plants and Space Shuttle Main Engine telemetry data. The present investigation was undertaken to investigate the feasibility and practicability of applying MSET for realtime proactive detection of software aging mechanisms in complex, multiCPU servers. The procedure uses MSET for model based parameter estimation in conjunction with statistical fault detection and Bayesian fault decision processing. A realtime software telemetry harness was designed to continuously sample over 50 performance metrics related to computer system load, throughput, queue lengths, and transaction latencies. A series of fault injection experiments was conducted using a "memory leak" injector tool with controllable parasitic resource consumption rates. MSET was able to reliably detect the onset of resource contention problems with high sensitivity and excellent false-alarm avoidance. Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described.
Keywords :
electronic commerce; fault tolerant computing; parameter estimation; software performance evaluation; state estimation; statistical analysis; system recovery; Bayesian fault decision processing; NASA-funded innovation; Space Shuttle Main Engine telemetry data; business critical computer system; business critical eCommerce servers; computer system load; controllable parasitic resource consumption rates; dynamic failover system events; dynamic sensor; false-alarm avoidance; fault injection experiments; fault tolerant systems; memory leak injector tool; mission critical computer system; model based parameter estimation; multiCPU servers; multivariate state estimation technique; nuclear power plants; performance critical computers; performance degradation; proactive detection; queue length; realtime software telemetry harness; resource contention; signal anomalies; software aging mechanism; spin-off applications; statistical fault detection; statistical pattern recognition method; throughput; transaction latencies; Aging; Degradation; Event detection; Fault detection; Fault tolerant systems; Mission critical systems; NASA; Pattern recognition; Software performance; Telemetry;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering Workshop, 2002. Proceedings. 27th Annual NASA Goddard/IEEE
Print_ISBN :
0-7695-1855-9
Type :
conf
DOI :
10.1109/SEW.2002.1199445
Filename :
1199445
Link To Document :
بازگشت