• DocumentCode
    3077358
  • Title

    An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications

  • Author

    Sheng Di ; Berrocal, Eduardo ; Cappello, Franck

  • Author_Institution
    Argonne Nat. Lab., Argonne, IL, USA
  • fYear
    2015
  • fDate
    4-7 May 2015
  • Firstpage
    271
  • Lastpage
    280
  • Abstract
    The silent data corruption (SDC) problem is attracting more and more attentions because it is expected to have a great impact on exascale HPC applications. SDC faults are hazardous in that they pass unnoticed by hardware and can lead to wrong computation results. In this work, we formulate SDC detection as a runtime one-step-ahead prediction method, leveraging multiple linear prediction methods in order to improve the detection results. The contributions are twofold: (1) we propose an error feedback control model that can reduce the prediction errors for different linear prediction methods, and (2) we propose a spatial-data-based even-sampling method to minimize the detection overheads (including memory and computation cost). We implement our algorithms in the fault tolerance interface, a fault tolerance library with multiple checkpoint levels, such that users can conveniently protect their HPC applications against both SDC errors and fail-stop errors. We evaluate our approach by using large-scale traces from well-known, large-scale HPC applications, as well as by running those HPC applications on a real cluster environment. Experiments show that our error feedback control model can improve detection sensitivity by 34-189% for bit-flip memory errors injected with the bit positions in the range [20,30], without any degradation on detection accuracy. Furthermore, memory size can be reduced by 33% with our spatial-data even-sampling method, with only a slight and graceful degradation in the detection sensitivity.
  • Keywords
    fault tolerant computing; parallel processing; performance evaluation; sampling methods; SDC detection; bit-flip memory errors; detection sensitivity; error feedback control model; error-feedback control; even sampling; exascale HPC applications; fail-stop errors; fault tolerance interface; fault tolerance library; graceful degradation; large-scale HPC applications; linear prediction methods; multiple checkpoint levels; multiple linear prediction methods; runtime one-step-ahead prediction method; silent data corruption detection method; spatial-data even-sampling method; spatial-data-based even-sampling method; Accuracy; Computational modeling; Correlation; Detectors; Feedback control; Predictive models; fault tolerance; silent data corruption;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
  • Conference_Location
    Shenzhen
  • Type

    conf

  • DOI
    10.1109/CCGrid.2015.17
  • Filename
    7152493