Title :
VirtCFT: A Transparent VM-Level Fault-Tolerant System for Virtual Clusters
Author :
Zhang, Minjia ; Jin, Hai ; Shi, Xuanhua ; Wu, Song
Author_Institution :
Services Comput. Technol. & Syst. Lab., Huazhong Univ. of Sci. & Technol., Wuhan, China
Abstract :
A virtual cluster consists of a multitude of virtual machines and software components that are doomed to fail eventually. In many environments, such failures can result in unanticipated, potentially devastating failure behavior and in service unavailability. The ability of failover is essential to the virtual cluster´s availability, reliability, and manageability. Most of the existing methods have several common disadvantages: requiring modifications to the target processes or their OSes, which is usually error prone and sometimes impractical; only targeting at taking checkpoints of processes, not whole entire OS images, which limits the areas to be applied. In this paper we present VirtCFT, an innovative and practical system of fault tolerance for virtual cluster. VirtCFT is a system-level, coordinated distributed checkpointing fault tolerant system. It coordinates the distributed VMs to periodically reach the globally consistent state and take the checkpoint of the whole virtual cluster including states of CPU, memory, disk of each VM as well as the network communications. When faults occur, VirtCFT will automatically recover the entire virtual cluster to the correct state within a few seconds and keep it running. Superior to all the existing fault tolerance mechanisms, VirtCFT provides a simpler and totally transparent fault tolerant platform that allows existing, unmodified software and operating system (version unawareness) to be protected from the failure of the physical machine on which it runs. We have implemented this system based on the Xen virtualization platform. Our experiments with real-world benchmarks demonstrate the effectiveness and correctness of VirtCFT.
Keywords :
checkpointing; distributed processing; object-oriented programming; operating systems (computers); software fault tolerance; virtual machines; workstation clusters; CPU; OS images; VirtCFT; Xen virtualization platform; checkpoints; coordinated distributed checkpointing fault tolerant system; distributed VM; failure behavior; fault tolerance mechanisms; globally consistent state; innovative system; memory; network communications; operating system; physical machine; practical system; real-world benchmarks; service unavailability; software components; transparent VM-level fault-tolerant system; unmodified software; virtual clusters; virtual machines; Coordinated Checkpointing; Fault Tolerance; High Availability; Virtual Machine;
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4244-9727-0
Electronic_ISBN :
1521-9097
DOI :
10.1109/ICPADS.2010.125