• DocumentCode
    692889
  • Title

    Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

  • Author

    Dong Li ; Zizhong Chen ; Panruo Wu ; Vetter, Jeffrey S.

  • Author_Institution
    Oak Ridge Nat. Lab., Oak Ridge, TN, USA
  • fYear
    2013
  • fDate
    17-22 Nov. 2013
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.
  • Keywords
    data structures; error correction codes; fault tolerant computing; performance evaluation; power aware computing; ABFT-enabled applications; ECC; algorithm-based fault tolerance; cooperative software-hardware approach; data structures; energy efficiency; error-correcting code; hardware resilience mechanisms; resilience ecosystem; rethinking algorithm-based fault tolerance; widely-used scientific computing kernels; Computer architecture; Error correction codes; Fault tolerance; Fault tolerant systems; Hardware; Registers; Resilience; adaptive resilience; algorithm-based fault tolerance; error-correcting code;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
  • Conference_Location
    Denver, CO
  • Print_ISBN
    978-1-4503-2378-9
  • Type

    conf

  • DOI
    10.1145/2503210.2503226
  • Filename
    6877477