DocumentCode :
3143948
Title :
HadoopRsync
Author :
Zhang, Jiaran ; Yu, Xiaohui ; Li, You ; Lin, Liwei
Author_Institution :
Sch. of Comput. Sci. & Technol., Shandong Univ., Jinan, China
fYear :
2011
fDate :
12-14 Dec. 2011
Firstpage :
166
Lastpage :
173
Abstract :
Cloud storage has become increasingly popular due to its convenience, cost-effectiveness and scalability. It provides the basis for a slate of file hosting services, which offer users the ability to synchronize their files between the servers and their devices. Naive file synchronization, however, requires the whole file to be transmitted to all other locations (servers, devices) whenever the file is updated in one location. This leads to massive waste of bandwidth and significant delays in propagating the update. We propose a method called HadoopRsync, which is capable of performing incremental update of files instead of transmitting them in entirety. This method is based on the rsync utility originally proposed for file synchronization between computers, but the scenario under consideration is significantly different from that for rsync in that in the cloud storage context, files are distributedly stored at multiple nodes in the cloud. We therefore propose a pair of algorithms called HadoopRsync Upload and HadoopRsync Download, which are responsible for the synchronization from the user´s devices to the cloud and the synchronization in the opposite direction respectively. These algorithms only transmit the differences between the new version of the file and the old version, rather than the whole file. Our solution is based on Hadoop, the open-source framework for distributed processing of very large data across clusters of computers. The algorithms utilize the MapReduce facility provided by Hadoop to fully taking advantage of its massive-parallelization capability. In addition, we propose some optimization measures to reduce the I/Os required for file update. Extensive experiments are conducted to evaluate the proposed solution, which show that HadoopRsync significantly outperforms the baseline methods.
Keywords :
cloud computing; parallel processing; public domain software; storage management; HadoopRsync download algorithm; HadoopRsync upload algorithm; cloud storage; distributed processing; file hosting services; massive-parallelization capability; naive file synchronization; open-source framework; rsync utility; Bandwidth; Cloud computing; Clustering algorithms; Computers; Encoding; Servers; Synchronization; Cloud storage; Hadoop; Rsync;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud and Service Computing (CSC), 2011 International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
978-1-4577-1635-5
Electronic_ISBN :
978-1-4577-1636-2
Type :
conf
DOI :
10.1109/CSC.2011.6138515
Filename :
6138515
Link To Document :
بازگشت