• DocumentCode
    2544276
  • Title

    A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud

  • Author

    Egwutuoha, Ifeanyi P. ; Shiping Chen ; Levy, David ; Selic, Bran ; Calvo, Rodrigo

  • Author_Institution
    Sch. of Electr. & Inf. Eng., Univ. of Sydney, Sydney, NSW, Australia
  • fYear
    2012
  • fDate
    1-3 Nov. 2012
  • Firstpage
    268
  • Lastpage
    273
  • Abstract
    Cloud computing offers new computing paradigms, capacity, and flexibility to high performance computing (HPC) applications with provisioning of a large number of Virtual Machines (VMs) for computation-intensive applications using the Hardware as a Service (HaaS) model. Due, however, to the large number of VMs and electronic components in HPC systems in the cloud, any fault during the execution would result in re-running the application, which will cost time, money and energy. In this paper we present a proactive Fault Tolerance (FT) approach to HPC systems in the cloud to reduce the wall clock execution time in the presence of faults. We develop a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We analyze the dollar cost of provisioning spare nodes to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall clock execution time of the computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of check pointing of computation-intensive applications can be reduced to 50% with our fault tolerance approach for HPC in the cloud, compared to current FT approaches.
  • Keywords
    checkpointing; cloud computing; fault tolerant computing; parallel processing; virtual machines; HPC; HaaS model; VM; checkpointing; cloud computing; cloud execution environment; computing paradigm; generic FT algorithm; hardware-as-a-service model; high performance computing; proactive fault tolerance approach; spare node; virtual machines; wall clock execution time; Fault tolerance; Fault tolerant systems; Hardware; Monitoring; Program processors; Temperature measurement; Temperature sensors; HPC; HaaS; Proactive Fault tolerance; cloud computing; computation-intensive application;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud and Green Computing (CGC), 2012 Second International Conference on
  • Conference_Location
    Xiangtan
  • Print_ISBN
    978-1-4673-3027-5
  • Type

    conf

  • DOI
    10.1109/CGC.2012.22
  • Filename
    6382828