Title :
Process variation and temperature-aware reliability management
Author :
Zhuo, Cheng ; Sylvester, Dennis ; Blaauw, David
Author_Institution :
EECS Dept., Univ. of Michigan, Ann Arbor, MI, USA
Abstract :
In aggressively scaled technologies, reliability concerns such as oxide breakdown have become a key issue. Dynamic reliability management (DRM) has been proposed as a mechanism to dynamically explore the trade-off between system performance and reliability margin. However, existing DRM methods are hampered by the fact that they do not accurately model spatial and temporal variations in process and temperature parameters which have a strong impact on chip reliability. In addition, they make the simplifying assumption that the future workloads are identical to the currently observed one. This makes them sensitive to sudden workload variations and outliers. In this paper, we present a novel workload-aware dynamic reliability management framework that accounts for local variations in both the process and temperature. The reliability estimation, along with the predicted remaining workload is fed to a dynamic voltage/frequency scaling module to manage the system reliability and optimize processor performance. Using a fast on-line analytical/table-look-up method we demonstrate an average error of 1% with up to 5 orders of magnitude speedup compared to Monte Carlo simulation. Experiments on an Alpha-like processor show our DRM framework fully utilizes the available margin and achieves 28.7% performance improvement on average.
Keywords :
Monte Carlo methods; integrated circuit reliability; microprocessor chips; power aware computing; Monte Carlo simulation; alpha-like processor; chip reliability; dynamic reliability management; dynamic voltage scaling module; frequency scaling module; process variation; processor performance; reliability estimation; temperature parameters; temperature-aware reliability management; Control systems; Dynamic voltage scaling; Electric breakdown; Failure analysis; Frequency estimation; Power system management; Power system reliability; System performance; Temperature sensors; Voltage control;
Conference_Titel :
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010
Conference_Location :
Dresden
Print_ISBN :
978-1-4244-7054-9
DOI :
10.1109/DATE.2010.5457139