Title :
Balancing reliability, cost, and performance tradeoffs with FreeFault
Author :
Dong Wan Kim ; Erez, Mattan
Author_Institution :
Electr. & Comput. Eng. Dept., Univ. of Texas at Austin, Austin, TX, USA
Abstract :
Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mechanisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability.
Keywords :
DRAM chips; cache storage; fault tolerant computing; integrated circuit reliability; performance evaluation; DRAM faults; FreeFault; fault rates; fault tolerance mechanisms; hardware memory scrubber; last-level cache; memory capacity; memory errors; memory health; reliability-cost-performance tradeoff balancing; retired memory regions; retirement decisions; system failures; tradeoff decisions; Error correction codes; Hardware; Maintenance engineering; Memory management; Random access memory; Retirement; Software;
Conference_Titel :
High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on
Conference_Location :
Burlingame, CA
DOI :
10.1109/HPCA.2015.7056053