DocumentCode :
8516
Title :
Characterizing the Impact of Intermittent Hardware Faults on Programs
Author :
Rashid, Layali ; Pattabiraman, Karthik ; Gopalakrishnan, Sathish
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of British Columbia, Vancouver, BC, Canada
Volume :
64
Issue :
1
fYear :
2015
fDate :
Mar-15
Firstpage :
297
Lastpage :
310
Abstract :
Extreme complimentary metal-oxide-semiconductor (CMOS) technology scaling is causing significant concerns in the reliability of computer systems. Intermittent hardware errors are non-deterministic bursts of errors that occur in the same physical location. Recent studies have found that 40% of the processor failures in real-world machines are due to intermittent hardware errors. A study of the effects of intermittent faults on programs is a critical step in building fault-tolerance techniques of reasonable accuracy and cost. In this work, we characterize the impact of intermittent hardware faults in programs using fault-injection campaigns in a microarchitectural processor simulator. We find that 80% of the non-benign intermittent hardware errors activate a hardware trap in the processor, and the remaining 20% cause silent data corruptions. We have also investigated the possibility of using the program state at failure time in software-based diagnosis techniques, and found that much of the erroneous data are intact and can be used to identify the source of the error.
Keywords :
CMOS integrated circuits; failure analysis; fault tolerance; integrated circuit reliability; microprocessor chips; CMOS technology scaling; computer system reliability; extreme complimentary metal oxide semiconductor; fault injection campaigns; fault tolerance; hardware trap; intermittent hardware faults; microarchitectural processor simulator; nonbenign intermittent hardware errors; nondeterministic bursts; processor failures; real-world machines; silent data corruptions; software-based diagnosis; Benchmark testing; Circuit faults; Computer crashes; Fault tolerance; Hardware; Microarchitecture; Transient analysis; Fault diagnosis; fault injection; fault model; fault propagation; intermittent hardware faults;
fLanguage :
English
Journal_Title :
Reliability, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9529
Type :
jour
DOI :
10.1109/TR.2014.2363152
Filename :
6933951
Link To Document :
بازگشت