Title :
Topology-aware reliability optimization for multiprocessor systems
Author :
Jie Meng;Fulya Kaplan;Mingyu Hsieh;Ayse K. Coskun
Author_Institution :
Electrical and Computer Engineering Department, Boston University, MA, USA
Abstract :
High on-chip temperatures adversely affect the reliability of processors, and reliability has become a serious concern as high performance computing moves towards exascale. While dynamic thermal management techniques can effectively constrain the chip temperature, most prior work has focused on temperature and reliability optimization of a single processor. In this work, we propose a topology-aware workload allocation policy to optimize the reliability of multi-chip multicore systems at runtime. Our results show that the proposed policy improves the system reliability by up to 123.3% compared to existing temperature balancing policies when systems have medium to high utilization. We also demonstrate that the policy is scalable to larger systems and its performance overhead is minimal.
Keywords :
"Reliability","Program processors","Multicore processing","Topology","Failure analysis","Resource management","Optimization"
Conference_Titel :
VLSI and System-on-Chip, 2012 (VLSI-SoC), IEEE/IFIP 20th International Conference on
Print_ISBN :
978-1-4673-2658-2
DOI :
10.1109/VLSI-SoC.2012.7332108