DocumentCode :
2265373
Title :
Frequency Based Chunking for Data De-Duplication
Author :
Lu, Guanlin ; Jin, Yu ; Du, David H C
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Minnesota, Minneapolis, MN, USA
fYear :
2010
fDate :
17-19 Aug. 2010
Firstpage :
287
Lastpage :
296
Abstract :
A predominant portion of Internet services, like content delivery networks, news broadcasting, blogs sharing and social networks, etc., is data centric. A significant amount of new data is generated by these services each day. To efficiently store and maintain backups for such data is a challenging task for current data storage systems. Chunking based deduplication (dedup) methods are widely used to eliminate redundant data and hence reduce the required total storage space. In this paper, we propose a novel Frequency Based Chunking (FBC) algorithm. Unlike the most popular Content-Defined Chunking (CDC) algorithm which divides the data stream randomly according to the content, FBC explicitly utilizes the chunk frequency information in the data stream to enhance the data deduplication gain especially when the metadata overhead is taken into consideration. The FBC algorithm consists of two components, a statistical chunk frequency estimation algorithm for identifying the globally appeared frequent chunks, and a two-stage chunking algorithm which uses these chunk frequencies to obtain a better chunking result. To evaluate the effectiveness of the proposed FBC algorithm, we conducted extensive experiments on heterogeneous datasets. In all experiments, the FBC algorithm persistently outperforms the CDC algorithm in terms of achieving a better dedup gain or producing much less number of chunks. Particularly, our experiments show that FBC produces 2.5 ~ 4 times less number of chunks than that of a baseline CDC which achieving the same Duplicate Elimination Ratio (DER). Another benefit of FBC over CDC is that the FBC with average chunk size greater than or equal to that of CDC achieves up to 50% higher DER than that of a CDC algorithm.
Keywords :
Internet; content management; storage management; Internet service; blogs sharing; chunk frequency estimation algorithm; chunk frequency information; chunking based deduplication; content delivery network; content-defined chunking algorithm; data backup; data centric; data deduplication; data storage system; data stream; duplicate elimination ratio; frequency based chunking algorithm; heterogeneous dataset; metadata overhead; news broadcasting; redundant data; social networks; Algorithm design and analysis; Clustering algorithms; Estimation; Filtering; Filtering algorithms; Internet; Power capacitors; Bloom Filter; Content-defined Chunking; Data Deduplication; Frequency based Chunking; Statistical Chunk Frequency Estimation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2010 IEEE International Symposium on
Conference_Location :
Miami Beach, FL
ISSN :
1526-7539
Print_ISBN :
978-1-4244-8181-1
Type :
conf
DOI :
10.1109/MASCOTS.2010.37
Filename :
5581583
Link To Document :
بازگشت