• DocumentCode
    181992
  • Title

    Rethread: A Low-Cost Transient Fault Recovery Scheme for Multithreaded Processors

  • Author

    Jian Fu ; Qiang Yang ; Poss, Raphael ; Jesshope, Chris R. ; Chunyuan Zhang

  • Author_Institution
    Inf. Inst., Univ. of Amsterdam, Amsterdam, Netherlands
  • fYear
    2014
  • fDate
    8-12 Sept. 2014
  • Firstpage
    88
  • Lastpage
    93
  • Abstract
    Transient fault recovery is important in processor availability. However, significant silicon or performance over-heads are characteristics of existing techniques. We uncover an opportunity to reduce the overheads dramatically in modern processors that appears as a side-effect of introducing hardware multithreading to improve performance. We observe that threads are usually short code sequences with no branches and few memory side-effects, which means that the number of checkpoints is small and constant. In addition, the state structures of a thread already presented in hardware can be reused to provide check pointing. In this paper, we demonstrate this principle of using a hardware/software co-design called Rethread, which features compiler-generated code annotations and automatic recovery in hardware by restarting threads. This approach provides the ability to recover from transient faults without dedicated hardware. Moreover, results show performance degradation under both fault-free condition (less than 5%) and as a function of fault rate.
  • Keywords
    checkpointing; fault tolerant computing; multi-threading; program compilers; Rethread; automatic recovery; check pointing; compiler-generated code annotations; fault-free condition; hardware multithreading; low-cost transient fault recovery scheme; modern processors; multithreaded processors; performance degradation; short code sequences; silicon; state structures; Bit error rate; Fault detection; Hardware; Instruction sets; Message systems; Transient analysis; fault recovery; multithreading; thread re-execution; transient fault;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Availability, Reliability and Security (ARES), 2014 Ninth International Conference on
  • Conference_Location
    Fribourg
  • Type

    conf

  • DOI
    10.1109/ARES.2014.18
  • Filename
    6980267