DocumentCode
2194307
Title
High Performance Pipelined Process Migration with RDMA
Author
Ouyang, Xiangyong ; Rajachandrasekar, Raghunath ; Besseron, Xavier ; Panda, Dhabaleswar K.
Author_Institution
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear
2011
fDate
23-26 May 2011
Firstpage
314
Lastpage
323
Abstract
Coordinated Checkpoint/Restart (C/R) is a widely deployed strategy to achieve fault-tolerance. However, C/R by itself is not capable enough to meet the demands of upcoming exascale systems, due to its heavy I/O overhead. Process migration has already been proposed in literature as a pro-active fault-tolerance mechanism to complement C/R. Several popular MPI implementations have provided support for process migration, includingMVAPICH2 and Open MPI. But these existing solutions cannot yield a satisfactory performance. In this paper we conduct extensive profiling on several process migration mechanisms, and reveal that inefficient I/O and network transfer are the principal factors responsible for the high overhead. We then propose anew approach, Pipelined Process Migration with RDMA(PPMR), to overcome these overheads. Our new protocol fully pipelines data writing, data transfer, and data read operations during different phases of a migration cycle. PPMR aggregates data writes on the migration source node and transfers data to the target node via high through put RDMA transport. It implements an efficient process restart mechanism at the target node to restart processes from the RDMA data streams. We have implemented this Pipelined Process Migration protocol in MVAPICH2 and studied the performance benefits. Experimental results show that PPMR achieves a 10.7X speedup to complete a process migration over the conventional approach at a moderate(8MB) memory usage. Process migration overhead on the application is significantly minimized from 38% to 5% by PPMR when three migrations are performed in succession.
Keywords
fault tolerant computing; message passing; pipeline processing; MPI implementations; MVAPICH2; PPMR; RDMA; coordinated checkpoint/restart; data read operations; data transfer; data writing; fault-tolerance; pipelined process migration mechanism; Fault tolerance; Fault tolerant systems; Fuses; Libraries; Protocols; Servers; Writing; RDMA; fault-tolerance; pipelining; process-migration;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on
Conference_Location
Newport Beach, CA
Print_ISBN
978-1-4577-0129-0
Electronic_ISBN
978-0-7695-4395-6
Type
conf
DOI
10.1109/CCGrid.2011.76
Filename
5948622
Link To Document