DocumentCode :
3459783
Title :
File Deduplication with Cloud Storage File System
Author :
Chan-I Ku ; Guo-Heng Luo ; Che-Pin Chang ; Shyan-Ming Yuan
Author_Institution :
Degree Program of ECE & CS Colleges, Nat. Chiao Tung Univ., Hsinchu, Taiwan
fYear :
2013
fDate :
3-5 Dec. 2013
Firstpage :
280
Lastpage :
287
Abstract :
The Hadoop Distributed File System (HDFS) is used to solve the storage problem of huge data, but does not provide a handling mechanism of duplicate files. In this study, the middle layer file system in the HBASE virtual architecture is used to do File Deduplicate in HDFS, with two architectures proposed according to different requires of the applied requirement reliability, therein one is RFD-HDFS (Reliable File Deduplicated HDFS) which is not permitted to have any errors and the other is FD-HDFS (File Deduplicated HDFS) which can tolerate very few errors. In addition to the advantage of the space complexity, the marginal benefits from it are explored. Assuming a popular video is uploaded to HDFS by one million users, through the Hadoop replication, they are divided into three million files to store, that is a practice wasting disk space very much and only by the cloud to remove repeats for effectively loading. By that, only three file spaces are taken up, namely the 100% utility of removing duplicate files reaches. The experimental architecture is a cloud based documentation system, like the version of EndNote Cloud, to simulate the cluster effect of massive database when the researcher synchronized the data with cloud storage.
Keywords :
cloud computing; data handling; storage management; EndNote Cloud; HBASE virtual architecture; Hadoop distributed file system; Hadoop replication; RFD-HDFS; cloud based documentation system; cloud storage file system; cluster effect; disk space; duplicate file removal; duplicate files handling mechanism; file deduplication; huge data storage problem; massive database; middle layer file system; reliable file deduplicated HDFS; repeat removal; requirement reliability; space complexity; Bandwidth; Cloud computing; Computer architecture; File systems; Google; Reliability; Writing; Cloud Computing; Data Deduplication; HDFS; Single instance storage;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
Conference_Location :
Sydney, NSW
Type :
conf
DOI :
10.1109/CSE.2013.52
Filename :
6755230
Link To Document :
بازگشت