Title :
REMEDIATE: A scalable fault-tolerant architecture for low-power NUCA cache in tiled CMPs
Author :
BanaiyanMofrad, Abbas ; Homayoun, Houman ; Kontorinis, Vasileios ; Tullsen, Dean ; Dutt, Nikil
Author_Institution :
Dept. of Comput. Sci., Univ. of California, Irvine, Irvine, CA, USA
Abstract :
Technology scaling and process variation severely degrade the reliability of Chip Multiprocessors (CMPs), especially their large cache blocks. To improve cache reliability, we propose REMEDIATE, a scalable fault-tolerant architecture for low-power design of shared Non-Uniform Cache Access (NUCA) cache in Tiled CMPs. REMEDIATE achieves fault-tolerance through redundancy from multiple banks to maximize the amount of fault remapping, and minimize the amount of capacity lost in the cache when the failure rate is high. REMEDIATE leverages a scalable fault protection technique using two different remapping heuristics in a distributed shared cache architecture with non-uniform latencies. We deploy a graph coloring algorithm to optimize REMEDIATE´s remapping configuration. We perform an extensive design space exploration of operating voltage, performance, and power that enables designers to select different operating points and evaluate their design efficacy. Experimental results on a 4×4 tiled CMP system voltage scaled to below 400mV show that REMEDIATE saves up to 50% power while recovering more than 80% of the faulty cache area with only modest performance degradation.
Keywords :
cache storage; distributed shared memory systems; fault tolerant computing; graph colouring; multiprocessing systems; power aware computing; system recovery; REMEDIATEs remapping configuration; cache reliability improvement; chip multiprocessors; distributed shared cache architecture; failure rate; fault protection technique; fault remapping; graph coloring algorithm; low-power NUCA cache; nonuniform latencies; operating voltage; performance; redundancy; remapping heuristics; scalable fault-tolerant architecture; shared nonuniform cache access; tiled CMP system; Blogs; Degradation; Fault tolerance; Fault tolerant systems; Hardware; Maintenance engineering; Aggressive voltage scaling; Fault-tolerant cache; Remapping;
Conference_Titel :
Green Computing Conference (IGCC), 2013 International
Conference_Location :
Arlington, VA
DOI :
10.1109/IGCC.2013.6604500