• DocumentCode
    2016897
  • Title

    Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

  • Author

    Fu, Song

  • Author_Institution
    Dept. of Comput. Sci., New Mexico Inst. of Min. & Technol., Socorro, NM
  • fYear
    2009
  • fDate
    18-21 May 2009
  • Firstpage
    372
  • Lastpage
    379
  • Abstract
    In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes´ reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS parallel benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.
  • Keywords
    grid computing; software performance evaluation; software reliability; virtual machines; NAS parallel benchmark programs; failure management techniques; failure-aware construction; high availability computing; reconfigurable distributed virtual machine infrastructure; system designers; Availability; Clustering algorithms; Costs; Distributed computing; Grid computing; Large-scale systems; System performance; Virtual machining; Virtual manufacturing; Voice mail; Distributed virtual machines; Failure-aware resource management; High availability computing; System reconfiguration;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2009. CCGRID '09. 9th IEEE/ACM International Symposium on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4244-3935-5
  • Electronic_ISBN
    978-0-7695-3622-4
  • Type

    conf

  • DOI
    10.1109/CCGRID.2009.21
  • Filename
    5071894