Title :
System availability monitoring
Author :
Moran, Pat ; Gaffney, Pat ; Melody, John ; Condon, Maria ; Hayden, Margaret
Author_Institution :
Digital Equipment Int. BV, Galway, Ireland
fDate :
10/1/1990 12:00:00 AM
Abstract :
A process set up by Digital to monitor and quantify the availability of its systems is described. The reliability data are collected in an automated manner and stored in a database. The breadth of data gathered provides a unique opportunity to correlate hardware andsoftware failures. In addition, several hypotheses have been tested, e.g. the relationship between crash rate and system load, the interdependence of crashes, the cause of crashes, and the effect of new releases in the operating system. It is concluded that the process (in operation since 1988) has yielded worthwhile information on the products monitored. The usual availability metrics are calculated regularly for the machines monitored. Trends in system fault occurrence have been identified, leading to suggestions for both software and hardware improvements. The monitoring process and analysis methodology are revised on an ongoing basis to improve the quality of information obtained and to extend the analysis to Digital´s new systems. The recently announced VAX9000 mainframe and fault-tolerant VAXft 3000 are two such systems
Keywords :
computer installation; fault tolerant computing; reliability; Digital; VAX9000 mainframe; availability metrics; crash rate; fault-tolerant VAXft 3000; operating system; reliability data; software failures; system fault occurrence; system load; Availability; Computer crashes; Condition monitoring; Databases; Fault diagnosis; Hardware; Information analysis; Operating systems; System testing; Vehicle crash testing;
Journal_Title :
Reliability, IEEE Transactions on