Title :
DFHR: A Design Framework for HPC Reliability
Author_Institution :
Jiangnan Inst. of Comput. Technol., Wuxi, China
Abstract :
This paper proposed a design framework for HPC reliability (DFHR) which consists of reliability prediction, failure hotspots (FHs) exploring, reliability program planning, and reliability plan implementation. In DFHR, 1) a concept of failure hotspot is proposed which provides targets for reliability engineering; 2) a unified process of reliability design is included in the framework with a constant respect to reliability technique costs; 3) the engineering practicability of the framework is achieved by hierarchy reliability planning and reasonable approximation; 4) highly precise reliability prediction methods are included in the framework. Based on the practice results of DFHR in the reliability design of a high performance computer, we demonstrate that DFHR is suitable for the cost-effective reliability design of HPC systems.
Keywords :
DRAM chips; performance evaluation; reliability; DRAM chip; HPC reliability; high performance computer; reliability prediction; reliability program planning; reliability technique costs; Automatic testing; Availability; Design engineering; Hardware; High performance computing; Process design; Reliability engineering; Sockets; Supercomputers; System testing; HPC reliability; reliability design;
Conference_Titel :
Frontier of Computer Science and Technology, 2009. FCST '09. Fourth International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-0-7695-3932-4
Electronic_ISBN :
978-1-4244-5467-9
DOI :
10.1109/FCST.2009.46