DocumentCode :
2502209
Title :
Understanding large system failures-a fault injection experiment
Author :
Chillarege, R. ; Bowen, N.S.
Author_Institution :
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
fYear :
1989
fDate :
21-23 June 1989
Firstpage :
356
Lastpage :
363
Abstract :
Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.<>
Keywords :
fault tolerant computing; software reliability; catastrophic failure; commercial transaction processing system; failure acceleration; fault injection experiment; field failure data; large system failures; modeling of availability; potential hazards; Acceleration; Automatic control; Automatic testing; Automation; Cause effect analysis; Control systems; Failure analysis; Hazards; Laboratories; Software systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fault-Tolerant Computing, 1989. FTCS-19. Digest of Papers., Nineteenth International Symposium on
Conference_Location :
Chicago, IL, USA
Print_ISBN :
0-8186-1959-7
Type :
conf
DOI :
10.1109/FTCS.1989.105592
Filename :
105592
Link To Document :
بازگشت