Title : 
DFHR: A Design Framework for HPC Reliability
         
        
        
            Author_Institution : 
Jiangnan Inst. of Comput. Technol., Wuxi, China
         
        
        
        
        
        
            Abstract : 
This paper proposed a design framework for HPC reliability (DFHR) which consists of reliability prediction, failure hotspots (FHs) exploring, reliability program planning, and reliability plan implementation. In DFHR, 1) a concept of failure hotspot is proposed which provides targets for reliability engineering; 2) a unified process of reliability design is included in the framework with a constant respect to reliability technique costs; 3) the engineering practicability of the framework is achieved by hierarchy reliability planning and reasonable approximation; 4) highly precise reliability prediction methods are included in the framework. Based on the practice results of DFHR in the reliability design of a high performance computer, we demonstrate that DFHR is suitable for the cost-effective reliability design of HPC systems.
         
        
            Keywords : 
DRAM chips; performance evaluation; reliability; DRAM chip; HPC reliability; high performance computer; reliability prediction; reliability program planning; reliability technique costs; Automatic testing; Availability; Design engineering; Hardware; High performance computing; Process design; Reliability engineering; Sockets; Supercomputers; System testing; HPC reliability; reliability design;
         
        
        
        
            Conference_Titel : 
Frontier of Computer Science and Technology, 2009. FCST '09. Fourth International Conference on
         
        
            Conference_Location : 
Shanghai
         
        
            Print_ISBN : 
978-0-7695-3932-4
         
        
            Electronic_ISBN : 
978-1-4244-5467-9
         
        
        
            DOI : 
10.1109/FCST.2009.46