• DocumentCode
    579758
  • Title

    VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters

  • Author

    Lu, Peng ; Ravindran, Binoy ; Kim, Changsoo

  • Author_Institution
    ECE Dept., Virginia Tech, Blacksburg, VA, USA
  • fYear
    2012
  • fDate
    24-26 Oct. 2012
  • Firstpage
    203
  • Lastpage
    210
  • Abstract
    A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, inter-connected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC´s availability and usability. We present Virtual Predict Check pointing (or VPC), a lightweight, globally consistent check pointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each check pointing interval, VPC further reduces the solo VM downtime than traditional incremental check pointing approaches. Besides, VPC uses a globally consistent check-pointing algorithm, which preserves the global consistency of the VMs´ execution and communication states, and only saves the updated memory pages during each check pointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC check pointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC´s performance overhead is less than 16%.
  • Keywords
    checkpointing; fault tolerant computing; protocols; virtual machines; workstation clusters; NPB benchmark; NPB distributed program; VC downtime; VC migration solution; VM communication state; VM execution state; VM failure; VNsnap; VPC; check pointing interval; checkpoint-caused page fault; fault-tolerant mechanism; fault-tolerant protocol; global consistency; globally consistent check-pointing algorithm; memory overhead; performance overhead; physical host; scalable low downtime checkpointing; solo VM downtime; system restoration; updated memory page saving; virtual clusters; virtual machines; virtual network interconnection; virtual predict check pointing; Checkpointing; Fault tolerance; Hardware; Memory management; Nonvolatile memory; Random access memory; Servers; Checkpointing; Prediction; Virtual Machine;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture and High Performance Computing (SBAC-PAD), 2012 IEEE 24th International Symposium on
  • Conference_Location
    New York, NY
  • ISSN
    1550-6533
  • Print_ISBN
    978-1-4673-4790-7
  • Type

    conf

  • DOI
    10.1109/SBAC-PAD.2012.31
  • Filename
    6374790