• DocumentCode
    3208288
  • Title

    DFHR: A Design Framework for HPC Reliability

  • Author

    Huang, Yongqin

  • Author_Institution
    Jiangnan Inst. of Comput. Technol., Wuxi, China
  • fYear
    2009
  • fDate
    17-19 Dec. 2009
  • Firstpage
    655
  • Lastpage
    660
  • Abstract
    This paper proposed a design framework for HPC reliability (DFHR) which consists of reliability prediction, failure hotspots (FHs) exploring, reliability program planning, and reliability plan implementation. In DFHR, 1) a concept of failure hotspot is proposed which provides targets for reliability engineering; 2) a unified process of reliability design is included in the framework with a constant respect to reliability technique costs; 3) the engineering practicability of the framework is achieved by hierarchy reliability planning and reasonable approximation; 4) highly precise reliability prediction methods are included in the framework. Based on the practice results of DFHR in the reliability design of a high performance computer, we demonstrate that DFHR is suitable for the cost-effective reliability design of HPC systems.
  • Keywords
    DRAM chips; performance evaluation; reliability; DRAM chip; HPC reliability; high performance computer; reliability prediction; reliability program planning; reliability technique costs; Automatic testing; Availability; Design engineering; Hardware; High performance computing; Process design; Reliability engineering; Sockets; Supercomputers; System testing; HPC reliability; reliability design;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Frontier of Computer Science and Technology, 2009. FCST '09. Fourth International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-0-7695-3932-4
  • Electronic_ISBN
    978-1-4244-5467-9
  • Type

    conf

  • DOI
    10.1109/FCST.2009.46
  • Filename
    5392846