• DocumentCode
    3706518
  • Title

    Assessing the Impact of Partial Verifications against Silent Data Corruptions

  • Author

    Aurélien ;Saurabh K. Raina;Yves Robert;Hongyang Sun

  • Author_Institution
    INRIA, Ecole Normale Super. de Lyon, Lyon, France
  • fYear
    2015
  • Firstpage
    440
  • Lastpage
    449
  • Abstract
    Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic check pointing approaches devised for fail-stop errors. Instead, check pointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we assess the impact of using partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light cost but less accurate verifications in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to the first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Performance evaluations based on a wide range of parameters confirm the benefit of using partial verifications under certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.
  • Keywords
    "Checkpointing","Protocols","Approximation methods","Redundancy","Analytical models","Performance evaluation","Resilience"
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2015 44th International Conference on
  • ISSN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2015.53
  • Filename
    7349599