• DocumentCode
    8516
  • Title

    Characterizing the Impact of Intermittent Hardware Faults on Programs

  • Author

    Rashid, Layali ; Pattabiraman, Karthik ; Gopalakrishnan, Sathish

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of British Columbia, Vancouver, BC, Canada
  • Volume
    64
  • Issue
    1
  • fYear
    2015
  • fDate
    Mar-15
  • Firstpage
    297
  • Lastpage
    310
  • Abstract
    Extreme complimentary metal-oxide-semiconductor (CMOS) technology scaling is causing significant concerns in the reliability of computer systems. Intermittent hardware errors are non-deterministic bursts of errors that occur in the same physical location. Recent studies have found that 40% of the processor failures in real-world machines are due to intermittent hardware errors. A study of the effects of intermittent faults on programs is a critical step in building fault-tolerance techniques of reasonable accuracy and cost. In this work, we characterize the impact of intermittent hardware faults in programs using fault-injection campaigns in a microarchitectural processor simulator. We find that 80% of the non-benign intermittent hardware errors activate a hardware trap in the processor, and the remaining 20% cause silent data corruptions. We have also investigated the possibility of using the program state at failure time in software-based diagnosis techniques, and found that much of the erroneous data are intact and can be used to identify the source of the error.
  • Keywords
    CMOS integrated circuits; failure analysis; fault tolerance; integrated circuit reliability; microprocessor chips; CMOS technology scaling; computer system reliability; extreme complimentary metal oxide semiconductor; fault injection campaigns; fault tolerance; hardware trap; intermittent hardware faults; microarchitectural processor simulator; nonbenign intermittent hardware errors; nondeterministic bursts; processor failures; real-world machines; silent data corruptions; software-based diagnosis; Benchmark testing; Circuit faults; Computer crashes; Fault tolerance; Hardware; Microarchitecture; Transient analysis; Fault diagnosis; fault injection; fault model; fault propagation; intermittent hardware faults;
  • fLanguage
    English
  • Journal_Title
    Reliability, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9529
  • Type

    jour

  • DOI
    10.1109/TR.2014.2363152
  • Filename
    6933951