DocumentCode :
2446188
Title :
REMEM: REmote MEMory as Checkpointing Storage
Author :
Jin, Hui ; Sun, Xian-He ; Chen, Yong ; Ke, Tao
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
fYear :
2010
fDate :
Nov. 30 2010-Dec. 3 2010
Firstpage :
319
Lastpage :
326
Abstract :
Check pointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based check pointing has been extensively studied in research but made little success in practice due to its complexity and potential reliability concerns. In this study we present the design and implementation of REMEM, a Remote Memory check pointing system to extend the check pointing storage from disk to remote memory. A unique feature of REMEM is that it can be integrated into existing disk-based check pointing systems seamlessly. A user can flexibly switch between REMEM and disk as check pointing storage to balance the efficiency and reliability. The implementation of REMEM on Open MPI is also introduced. The experimental results confirm that REMEM and the proposed adaptive check pointing storage selection are promising in both performance, reliability and scalability.
Keywords :
application program interfaces; checkpointing; disc storage; fault tolerant computing; message passing; REMEM; checkpointing storage; disk storage; disk-based checkpointing system; fault tolerance; open MPI; remote memory; Checkpointing; Head; Random access memory; Reliability; Sun; Switches; Topology; Checkpointing; Fault Tolerance; High-Performance Computing; Performance; Remote Memory;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on
Conference_Location :
Indianapolis, IN
Print_ISBN :
978-1-4244-9405-7
Electronic_ISBN :
978-0-7695-4302-4
Type :
conf
DOI :
10.1109/CloudCom.2010.102
Filename :
5708466
Link To Document :
بازگشت