DocumentCode
107984
Title
Relyzer: Application Resiliency Analyzer for Transient Faults
Author
Sastry Hari, Siva Kumar ; Adve, Sarita V. ; Naeimi, Helia ; Ramachandran, Prasadh
Author_Institution
Univ. of Illinois at Urbana-Champaign, Champaign, IL, USA
Volume
33
Issue
3
fYear
2013
fDate
May-June 2013
Firstpage
58
Lastpage
66
Abstract
Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost software-level symptom monitors. However, there remains a nonnegligible risk that several faults might escape these detectors to produce silent data corruptions (SDCs). Evaluating and bounding SDCs is, therefore, crucial for low-cost resiliency solutions. The authors present Relyzer, an approach that can systematically analyze all application fault sites and identify virtually all SDC-causing program locations. Instead of performing fault injections on all possible application-level fault sites, which is impractical, Relyzer carefully picks a small subset. It employs novel fault-pruning techniques that reduce the number of fault sites by either predicting their outcomes or showing them equivalent to others. Results show that 99.78 percent of faults are pruned across 12 studied workloads, reducing the complete application resiliency evaluation time by 2 to 6 orders of magnitude. Relyzer, for the first time, achieves the capability to list virtually all SDC-vulnerable program locations, which is critical in designing low-cost application-centric resiliency solutions. Relyzer also opens new avenues of research in designing error-resilient programming models as well as even faster (and simpler) evaluation methodologies.
Keywords
error detection; program diagnostics; software fault tolerance; software reliability; Relyzer; SDC-causing program locations; SDC-vulnerable program locations; application resiliency analyzer; application resiliency evaluation; application-level fault sites; error-resilient programming models; failure-prone devices; fault injections; fault-pruning techniques; hardware fault detection; low-cost application-centric resiliency solutions; low-cost software-level symptom monitoring; microprocessors; silent data corruptions; transient faults; Computer architecture; Computer programs; Costs; Fault diagnosis; Hardware; Microprocessors; computer architecture; low-cost hardware resiliency; silent data corruption; transient faults;
fLanguage
English
Journal_Title
Micro, IEEE
Publisher
ieee
ISSN
0272-1732
Type
jour
DOI
10.1109/MM.2013.30
Filename
6487478
Link To Document