• DocumentCode
    2013872
  • Title

    Transient Fault Tolerance on Chip Multiprocessor Based on Dual and Triple Core Redundancy

  • Author

    Gong, Rui ; Dai, Kui ; Wang, Zhiying

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2008
  • fDate
    15-17 Dec. 2008
  • Firstpage
    273
  • Lastpage
    280
  • Abstract
    To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit the core redundancy of chip multiprocessors (CMPs). But the inter-core communications become critical in these core redundancy based techniques. To reduce the inter-core communication bandwidth demand, two new approaches, dual core redundancy (DCR) and triple core redundancy (TCR), are proposed for fault tolerance in this paper. In DCR, only store instructions are compared before commit, so that the bandwidth demand can be largely reduced. And the fault recovery is achieved by context saving and recovery. While TCR applies triple modular redundancy (TMR) in the core level to efficiently exploit the core resources of CMPs for transient fault masking. In TCR, only the results of store instructions are compared to detect transient fault and reduce the inter-core communication bandwidth demand. Once detecting a single event upset (SEU), TCR can be reconfigured to execute with the two uncorrupted cores for fault detection.The experimental results demonstrate that compared to traditional transient fault recovery scheme CRTR, both DCR and TCR efficiently reduce inter-core bandwidth demand. DCR achieves transient fault recovery with reasonable performance overhead caused by context saving. TCR occupies more core resources and has the lowest performance overhead during normal execution.
  • Keywords
    fault tolerance; microprocessor chips; redundancy; Transient fault tolerance; chip multiprocessors; dual core redundancy; fault detection; intercore communications; triple core redundancy; triple modular redundancy; Bandwidth; Cathode ray tubes; Context; Event detection; Fault detection; Fault tolerance; Microprocessors; Redundancy; Single event upset; Yarn; Chip Multiprocessor; Dual Core Redundancy; Transient Fault Tolerance; Triple Core Redundancy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Computing, 2008. PRDC '08. 14th IEEE Pacific Rim International Symposium on
  • Conference_Location
    Taipei
  • Print_ISBN
    978-0-7695-3448-0
  • Electronic_ISBN
    978-0-7695-3448-0
  • Type

    conf

  • DOI
    10.1109/PRDC.2008.40
  • Filename
    4725306