• DocumentCode
    2978905
  • Title

    Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors

  • Author

    Li, Xiaobin ; Gaudiot, Jean-Luc

  • Author_Institution
    Enterprise Microprocessor Group, Intel Corp.
  • fYear
    2006
  • fDate
    Dec. 2006
  • Firstpage
    315
  • Lastpage
    322
  • Abstract
    Since the very concept of simultaneous multi-threading (SMT) entails inherent redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, for the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby resulting in a completely error-free execution. This paper focuses on two crucial implementation issues introduced by this concept: (i) the design trade-off between the fault detection coverage versus the design costs; (ii) the possible occurrence of deadlock situations. To achieve the largest possible fault detection coverage, we replicate the instructions fetched in order to generate the redundant thread copies. Further, we apply the SMT thread scheduling at the instruction dispatch stage so as to lower the performance overhead. As a result, when compared to the baseline processor, our simulation results show that by using our two new schemes, the performance overhead can be reduced down to as little as 34% on the average, down from 42%. Finally, in the fault-tolerant execution mode, since the two copied threads are cooperating with one another, deadlock situations could be quite common. We thus present a detailed deadlock analysis and then conclude that allocating some entries of ROB, LQ, and SQ for the trailing thread is sufficient to prevent such deadlocks
  • Keywords
    fault tolerant computing; multi-threading; processor scheduling; resource allocation; system recovery; SMT thread scheduling; deadlock prevention; resource allocation; simultaneous multi-threading processor; trade-off design; transient fault-tolerant; Costs; Error correction; Fault detection; Fault tolerance; Processor scheduling; Proposals; Redundancy; Surface-mount technology; System recovery; Yarn;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Computing, 2006. PRDC '06. 12th Pacific Rim International Symposium on
  • Conference_Location
    Riverside, CA
  • Print_ISBN
    0-7695-2724-8
  • Type

    conf

  • DOI
    10.1109/PRDC.2006.25
  • Filename
    4041917