Title :
CrashTest´ing SWAT: Accurate, gate-level evaluation of symptom-based resiliency solutions
Author :
Pellegrini, A. ; Smolinski, R. ; Chen, L. ; Fu, X. ; Hari, S.K.S. ; Jiang, J. ; Adve, S.V. ; Austin, T. ; Bertacco, V.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Univ. of Michigan, Ann Arbor, MI, USA
Abstract :
Current technology scaling is leading to increasingly fragile components, making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through software-level symptom monitoring. SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated with microarchitecture-level simulations that symptom-based solutions can provide high fault coverage and a low Silent Data Corruption (SDC) rate. However, more accurate evaluations are needed to validate such solutions for hardware faults in real-world processor designs. In this paper, we evaluate SWAT´s symptom-based detectors on gate-level faults using an FPGA-based, full-system prototype. With this platform, we performed a gate-level accurate fault injection campaign of 51,630 fault injections in the OpenSPARC T1 core logic across five SPECInt 2000 benchmarks. With an overall SDC rate of 0.79%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effectiveness of symptom-based software detectors for permanent faults in real-world designs.
Keywords :
circuit reliability; fault diagnosis; field programmable gate arrays; logic gates; microprocessor chips; microsensors; network synthesis; FPGA-based full-system prototype; OpenSPARC Tl core logic; SDC rate; SPECInt 2000 benchmark; SWAT symptom-based software detector evaluation; crash testing; fault coverage; fault injection; fragile component; gate-level accurate fault injection campaign; gate-level fault evaluation; hardware fault detection solution; hardware reliability solution; microarchitecture-level evaluation; microarchitecture-level simulation; microprocessor core; processor design; silent data corruption rate; software anomaly treatment; software-level symptom monitoring; symptom-based resiliency solution; Circuit faults; Detectors; Field programmable gate arrays; Hardware; Logic gates; Microarchitecture; Software;
Conference_Titel :
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012
Conference_Location :
Dresden
Print_ISBN :
978-1-4577-2145-8
DOI :
10.1109/DATE.2012.6176660