• DocumentCode
    107984
  • Title

    Relyzer: Application Resiliency Analyzer for Transient Faults

  • Author

    Sastry Hari, Siva Kumar ; Adve, Sarita V. ; Naeimi, Helia ; Ramachandran, Prasadh

  • Author_Institution
    Univ. of Illinois at Urbana-Champaign, Champaign, IL, USA
  • Volume
    33
  • Issue
    3
  • fYear
    2013
  • fDate
    May-June 2013
  • Firstpage
    58
  • Lastpage
    66
  • Abstract
    Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost software-level symptom monitors. However, there remains a nonnegligible risk that several faults might escape these detectors to produce silent data corruptions (SDCs). Evaluating and bounding SDCs is, therefore, crucial for low-cost resiliency solutions. The authors present Relyzer, an approach that can systematically analyze all application fault sites and identify virtually all SDC-causing program locations. Instead of performing fault injections on all possible application-level fault sites, which is impractical, Relyzer carefully picks a small subset. It employs novel fault-pruning techniques that reduce the number of fault sites by either predicting their outcomes or showing them equivalent to others. Results show that 99.78 percent of faults are pruned across 12 studied workloads, reducing the complete application resiliency evaluation time by 2 to 6 orders of magnitude. Relyzer, for the first time, achieves the capability to list virtually all SDC-vulnerable program locations, which is critical in designing low-cost application-centric resiliency solutions. Relyzer also opens new avenues of research in designing error-resilient programming models as well as even faster (and simpler) evaluation methodologies.
  • Keywords
    error detection; program diagnostics; software fault tolerance; software reliability; Relyzer; SDC-causing program locations; SDC-vulnerable program locations; application resiliency analyzer; application resiliency evaluation; application-level fault sites; error-resilient programming models; failure-prone devices; fault injections; fault-pruning techniques; hardware fault detection; low-cost application-centric resiliency solutions; low-cost software-level symptom monitoring; microprocessors; silent data corruptions; transient faults; Computer architecture; Computer programs; Costs; Fault diagnosis; Hardware; Microprocessors; computer architecture; low-cost hardware resiliency; silent data corruption; transient faults;
  • fLanguage
    English
  • Journal_Title
    Micro, IEEE
  • Publisher
    ieee
  • ISSN
    0272-1732
  • Type

    jour

  • DOI
    10.1109/MM.2013.30
  • Filename
    6487478