Title :
Cardio: CMP Adaptation for Reliability Through Dynamic Introspective Operation
Author :
Pellegrini, Alessandro ; Bertacco, Valeria
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Univ. of Michigan, Ann Arbor, MI, USA
Abstract :
A modern digital system includes in a single chip many components: processing cores, large caches, memory controllers, and hardware accelerators. Looking forward, future semiconductor technologies will enable even higher device integration, overall increasing system performance while reducing energy consumption. Unfortunately, prominent experts agree that such technologies will be prone to both permanent and transient faults within their lifetime. With the goal of addressing this issue, we propose Cardio: a low-cost architecture for reliable chip multiprocessors. Our solution is based on a novel hardware/software co-design where silicon failures are detected in hardware and system reconfiguration is managed in software. Comparing Cardio with a state-of-the-art hardware-based resiliency solution, Immunet, we found that our design can achieve a comparable fault response time while requiring a much lower area overhead. The proposed solution relies on a distributed resource manager to collect information about a CMP component´s health, and leverages a synchronized distributed control mechanism to recover from permanent failures. Such architecture can operate as long as at least one general-purpose processor is still functional. Our experimental evaluation indicates that the overall performance impact of Cardio is as low as 4.5%, and its dynamic reconfiguration time upon fault detection is comprised between 20 and 50 thousand cycles.
Keywords :
circuit CAD; hardware-software codesign; integrated circuit reliability; multiprocessor interconnection networks; CMP adaptation; Cardio; dynamic introspective operation; energy consumption; hardware accelerators; large caches; low cost architecture; memory controllers; processing cores; reliable chip multiprocessors; synchronized distributed control mechanism; Circuit faults; Hardware; Routing; Runtime; Software; Software reliability; Hardware reliability; and fault-tolerance; and serviceability; availability; modeling techniques; multiprocessor systems; reliability; testing;
Journal_Title :
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on
DOI :
10.1109/TCAD.2013.2284008