• DocumentCode
    1655865
  • Title

    CrashTest´ing SWAT: Accurate, gate-level evaluation of symptom-based resiliency solutions

  • Author

    Pellegrini, A. ; Smolinski, R. ; Chen, L. ; Fu, X. ; Hari, S.K.S. ; Jiang, J. ; Adve, S.V. ; Austin, T. ; Bertacco, V.

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., Univ. of Michigan, Ann Arbor, MI, USA
  • fYear
    2012
  • Firstpage
    1106
  • Lastpage
    1109
  • Abstract
    Current technology scaling is leading to increasingly fragile components, making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through software-level symptom monitoring. SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated with microarchitecture-level simulations that symptom-based solutions can provide high fault coverage and a low Silent Data Corruption (SDC) rate. However, more accurate evaluations are needed to validate such solutions for hardware faults in real-world processor designs. In this paper, we evaluate SWAT´s symptom-based detectors on gate-level faults using an FPGA-based, full-system prototype. With this platform, we performed a gate-level accurate fault injection campaign of 51,630 fault injections in the OpenSPARC T1 core logic across five SPECInt 2000 benchmarks. With an overall SDC rate of 0.79%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effectiveness of symptom-based software detectors for permanent faults in real-world designs.
  • Keywords
    circuit reliability; fault diagnosis; field programmable gate arrays; logic gates; microprocessor chips; microsensors; network synthesis; FPGA-based full-system prototype; OpenSPARC Tl core logic; SDC rate; SPECInt 2000 benchmark; SWAT symptom-based software detector evaluation; crash testing; fault coverage; fault injection; fragile component; gate-level accurate fault injection campaign; gate-level fault evaluation; hardware fault detection solution; hardware reliability solution; microarchitecture-level evaluation; microarchitecture-level simulation; microprocessor core; processor design; silent data corruption rate; software anomaly treatment; software-level symptom monitoring; symptom-based resiliency solution; Circuit faults; Detectors; Field programmable gate arrays; Hardware; Logic gates; Microarchitecture; Software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012
  • Conference_Location
    Dresden
  • ISSN
    1530-1591
  • Print_ISBN
    978-1-4577-2145-8
  • Type

    conf

  • DOI
    10.1109/DATE.2012.6176660
  • Filename
    6176660