• DocumentCode
    1472353
  • Title

    ERSA: Error Resilient System Architecture for Probabilistic Applications

  • Author

    Cho, Hyungmin ; Leem, Larkhoon ; Mitra, Subhasish

  • Author_Institution
    Dept. of Electr. Eng., Stanford Univ., Stanford, CA, USA
  • Volume
    31
  • Issue
    4
  • fYear
    2012
  • fDate
    4/1/2012 12:00:00 AM
  • Firstpage
    546
  • Lastpage
    558
  • Abstract
    There is a growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a robust system architecture which targets emerging killer applications such as recognition, mining, and synthesis (RMS) with inherent error resilience, and ensures high degrees of resilience at low cost. Using the concept of configurable reliability, ERSA may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs). While resilience of RMS applications to errors in low-order bits of data is well-known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control flow errors (in addition to low-order bit errors) using a judicious combination of the following key ideas: 1) asymmetric reliability in many-core architectures; 2) error-resilient algorithms at the core of probabilistic applications; and 3) intelligent software optimizations. Error injection experiments on a multicore ERSA hardware prototype demonstrate that, even at very high error rates of 20 errors/flip-flop/108 cycles (equivalent to 25000 errors/core/s), ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time, for probabilistic applications such as K-Means clustering, LDPC decoding, and Bayesian network inference. In addition, we demonstrate the effectiveness of ERSA in tolerating high rates of static memory errors that are characteristic of emerging challenges related to SRAM Vccmin problems and erratic bit errors.
  • Keywords
    SRAM chips; integrated circuit reliability; probability; Bayesian network inference; ERSA; LDPC decoding; SRAM; asymmetric reliability; configurable reliability; control flow errors; error injection; error resilient system architecture algorithm; high-order bit errors; intelligent software optimizations; k-means clustering; many-core architectures; multicore ERSA hardware prototype; redundancy techniques; robust system architecture; static memory errors; Computer architecture; Hardware; Instruction sets; Probabilistic logic; Reliability engineering; Resilience; Error resilience; hardware errors; probabilistic applications; reliability; robust systems;
  • fLanguage
    English
  • Journal_Title
    Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0278-0070
  • Type

    jour

  • DOI
    10.1109/TCAD.2011.2179038
  • Filename
    6171059