Title :
Checkpoint-restart for a network of virtual machines
Author :
Garg, Radhika ; Sodha, Komal ; Zhengping Jin ; Cooperman, Gene
Author_Institution :
Northeastern Univ., Boston, MA, USA
Abstract :
The ability to easily deploy parallel computations on the Cloud is becoming ever more important. The first uniform mechanism for checkpointing a network of virtual machines is described. This is important for the parallel versions of common productivity software. Potential examples of parallelism include Simulink for MATLAB, parallel R for the R statistical modelling language, parallel blast.py for the BLAST bioinformatics software, IPython.parallel for Python, and GNU parallel for parallel shells. The checkpoint mechanism is implemented as a plugin in the DMTCP checkpoint-restart package. It operates on KVM/QEMU, and has also been adapted to Lguest and pure user-space QEMU. The plugin is surprisingly compact, comprising just 400 lines of code to checkpoint a single virtual machine, and 200 lines of code for a plugin to support saving and restoring network state. Incremental checkpoints of the associated virtual filesystem are accommodated through the Btrfs filesystem. Experiments demonstrate checkpoint times of a fraction of a second by using forked checkpointing, mmap-based restart, and incremental Btrfs-based snapshots.
Keywords :
checkpointing; multi-threading; virtual machines; BLAST bioinformatics software; Btrfs filesystem; DMTCP checkpoint-restart package; GNU parallel; IPython.parallel; KVM/QEMU; MATLAB; QEMU; R statistical modelling language; Simulink; common productivity software; forked checkpointing; incremental Btrfs-based snapshots; mmap-based restart; operates Lguest; parallel R; parallel blast.py; parallel shells; virtual filesystem; virtual machine network; Checkpointing; Computers; Image restoration; Kernel; Optimization; Virtual machining;
Conference_Titel :
Cluster Computing (CLUSTER), 2013 IEEE International Conference on
Conference_Location :
Indianapolis, IN
DOI :
10.1109/CLUSTER.2013.6702626