DocumentCode :
1446842
Title :
PERFECTORY: A Fault-Tolerant Directory Memory Architecture
Author :
Lee, Hyunjin ; Cho, Sangyeun ; Childers, Bruce R.
Author_Institution :
Dept. of Comput. Sci., Univ. of Pittsburgh, Pittsburgh, PA, USA
Volume :
59
Issue :
5
fYear :
2010
fDate :
5/1/2010 12:00:00 AM
Firstpage :
638
Lastpage :
650
Abstract :
The number of CPUs in chip multiprocessors is growing at the Moore´s Law rate, due to continued technology advances. However, new technologies pose serious reliability challenges, such as more frequent occurrences of degraded or even nonoperational devices, and they threaten the cost-effectiveness and dependability of future computing systems. This work studies how to protect the on-chip coherence directory from fault occurrences. In a chip multiprocessor, cache coherence mechanisms such as directory memory are critical for offering consistent data view to all CPUs. We propose a novel online fault detection and correction scheme to enhance yield and resilience to runtime errors at a small performance cost. The proposed scheme uses smart encoding and coherence protocol adaptation strategies to salvage faulty directory entries. We also develop an online error recovery scheme that protects the directory memory from soft errors. We call our fault-tolerant directory memory architecture PERFECTORY. Evaluation results show that PERFECTORY achieves very high fault resilience: Over 99 percent chip yield at 0.05 percent hard error ratio and 1,934 years MTTF at 1,000 FIT using a 100-processor cluster configuration. PERFECTORY limits performance degradation to less than 1 percent at 0.05 percent hard error ratio and requires significantly smaller area overheads than existing redundancy approaches.
Keywords :
cache storage; encoding; fault tolerant computing; memory architecture; multiprocessing systems; protocols; system recovery; Moore Law rate; cache coherence mechanisms; chip multiprocessors; chip yield; coherence protocol adaptation strategy; encoding; fault correction scheme; fault detection; fault tolerant directory memory architecture; online error recovery scheme; Costs; Degradation; Error correction; Fault detection; Fault tolerance; Memory architecture; Moore´s Law; Protection; Resilience; Runtime; Chip multiprocessor; cache coherence; chip yield; lifetime reliability.;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2009.138
Filename :
5255232
Link To Document :
بازگشت