• DocumentCode
    2177745
  • Title

    Analyzing Reliability of Memory Sub-systems with Double-Chipkill Detect/Correct

  • Author

    Xun Jian ; DeBardeleben, Nathan ; Blanchard, Sean ; Sridharan, Vilas ; Kumar, Ravindra

  • Author_Institution
    Univ. of Illinois at Urbana Champaign, Urbana, IL, USA
  • fYear
    2013
  • fDate
    2-4 Dec. 2013
  • Firstpage
    88
  • Lastpage
    97
  • Abstract
    Chip kill correct is an advanced type of error correction used in memory sub-systems. Existing analytical approaches for modeling the reliability of memory sub-systems with chipkill correct are limited to those with chip kill-correct solutions that guarantee correction of errors in a single DRAM device. However, stronger chip kill correct solutions that are capable of guaranteeing the detection and even correction of errors in up to two DRAM devices have become common in existing HPC systems. Analytical reliability models are needed for such memory subsystems. This paper proposes analytical models for the reliability of double-chipkill detect and/or correct. Validation against Monte Carlo simulations shows that the output of our analytical models are within 3.9% of Monte Carlo simulations, on average. We used the analytical models to study various aspects of the reliability of memory sub-systems protected by double-chip kill detect and/or correct. Our studies provide several insights into the dependence of reliability of these systems on scale, device fault rate, memory organization, and memory-scrubbing policy.
  • Keywords
    DRAM chips; Monte Carlo methods; error correction codes; error detection codes; integrated circuit reliability; HPC systems; Monte Carlo simulations; analytical reliability models; device fault rate; double chip kill correct solutions; double-chip kill detect reliability; error correction codes; memory organization; memory sub-system reliability analysis; memory-scrubbing policy; single DRAM device; systems on scale; Analytical models; Error correction codes; Monte Carlo methods; Organizations; Random access memory; Reliability; Transient analysis; chipkill correct; error correcting codes; memory errors; modeling; reliability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on
  • Conference_Location
    Vancouver, BC
  • Type

    conf

  • DOI
    10.1109/PRDC.2013.18
  • Filename
    6820844