DocumentCode :
2887811
Title :
Transparent system-level migration of PGAS applications using Xen on InfiniBand
Author :
Scarpazza, D.P. ; Mullaney, P. ; Villa, O. ; Petrini, F. ; Tipparaju, V. ; Brown, D. M L, Jr. ; Nieplocha, J.
Author_Institution :
Pacific Northwest Nat. Lab., Richland, WA
fYear :
2007
fDate :
17-20 Sept. 2007
Firstpage :
74
Lastpage :
83
Abstract :
Checkpoint-restart is considered one of the most natural approaches to achieving fault-tolerance in a high-performance cluster. While experiences has focused attention on user-level solutions, the advent of efficient system-level virtualization software, such as Xen and VMWare, has opened the door to the possibility of efficient and scalable cluster-level virtualization. In this paper we present an innovative approach to cluster fault tolerance by integrating the Xen virtualization with the latest generation of the InfiniBand network. A major contribution of this approach is the automatic identification of global recovery lines to freeze the status of the machine. Our focus is on the partitioned global address space (PGAS) programming models. PGAS models has been receiving an increasing amount of attention in the recent years. We have developed a global coordination mechanism and deployed it in the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library that has been used as a run-time system for several PGAS languages and libraries. The experimental results show that it is possible to virtualize communication and computation with minimal overhead and to provide seamless migration capabilities.
Keywords :
checkpointing; fault tolerant computing; virtual machines; workstation clusters; ARMCI one-sided communication library; InfiniBand network; PGAS programming model; Xen virtual machine; aggregate remote memory copy interface; checkpoint-restart approach; global recovery line identification; high-performance cluster fault-tolerance; partitioned global address space; run-time system; system-level virtualization software; transparent system-level migration; Concurrent computing; Costs; Electronics packaging; Fault tolerance; Fault tolerant systems; Laboratories; Large-scale systems; Libraries; Programming profession; Supercomputers;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing, 2007 IEEE International Conference on
Conference_Location :
Austin, TX
ISSN :
1552-5244
Print_ISBN :
978-1-4244-1387-4
Electronic_ISBN :
1552-5244
Type :
conf
DOI :
10.1109/CLUSTR.2007.4629219
Filename :
4629219
Link To Document :
بازگشت