• DocumentCode
    642808
  • Title

    Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

  • Author

    Hao Chen ; Chengmo Yang

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Delaware, Newark, DE, USA
  • fYear
    2013
  • fDate
    Sept. 29 2013-Oct. 4 2013
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    The ever scaling-down feature size and noise margin keep elevating hardware failure rates, requiring the incorporation of fault tolerance into computer systems. One fault tolerance scheme that receives a lot of research attention is redundant execution. However, existing solutions are developed under the assumption that the fault rate is low. These techniques either solely focus on fault detection, or sometimes even increase recovery cost to reduce fault detection overhead. The lack of overall efficiency makes them insufficient and inappropriate for embedded systems with tight energy and cost budget. Our study shows that checkpoint frequency and fault rate are two critical parameters determining the overall fault detection and recovery overhead. To co-optimize detection and recovery, we statically construct a mathematical model, capable of taking application and architecture characteristics into consideration and identifying the optimal checkpoint frequency of an application for a given fault rate. Moreover, as the fault rate is infeasible to predict a priori, we furthermore propose a set of heuristics, enabling the system to dynamically monitor the fault rate and adapt the checkpoint frequency accordingly. The efficacy of the static and the adaptive optimizations is evaluated through detailed instructionlevel simulation. The results show that the optimal checkpoint frequency identified by the static model is very close to the actual value (6% deviation) and the run-time adaptation scheme effectively reduces the overhead caused by the unpredictability in fault rate.
  • Keywords
    checkpointing; fault tolerant computing; program compilers; program diagnostics; checkpoint frequency; compile-time analysis; embedded systems; fault detection efficiency co-optimization; fault detection overhead; fault rate; fault recovery efficiency co-optimization; fault recovery overhead; fault tolerance; feature size; instruction-level simulation; noise margin; redundant execution; runtime adaptation; Adaptation models; Checkpointing; Fault detection; Fault tolerant systems; Mathematical model; Registers; Runtime;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2013 International Conference on
  • Conference_Location
    Montreal, QC
  • Type

    conf

  • DOI
    10.1109/CASES.2013.6662528
  • Filename
    6662528