DocumentCode
688231
Title
Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication
Author
Liangshan Song ; Yuhui Deng ; Junjie Xie
Author_Institution
Dept. of Comput. Sci., Jinan Univ., Guangzhou, China
fYear
2013
fDate
13-15 Nov. 2013
Firstpage
849
Lastpage
856
Abstract
Data deduplication has become an important and economic way to remove the redundant data segments, thus alleviating the pressure incurred by large amounts of data need to store. Fingerprints are used to represent and identify identical data blocks when performing data deduplication. However, the amount of fingerprints grows with the increase of data. Due to the limited memory size, the fingerprints have to be stored in disk drives. When the fingerprints are not satisfied in memory, disk I/Os will be generated to obtain the on-disk fingerprints. This results in small and random I/Os, thus significantly degrading the performance of data deduplication. This paper introduces a fingerprint prefetching algorithm by leveraging file similarity and data locality. On the one hand, we present a similar file recognition algorithm to identify the similar files that are considered to have some modifications and share a large portion of identical data blocks. On the other hand, the on-disk fingerprints are organized according to the sequence of data streams, thus maintaining the data locality to improve the cache hit ratio. The proposed prefetching algorithm will request fingerprints from disk drives and place them in memory before they are actually needed. This will significantly improve the cache hit ratio when the fingerprints are actually needed, thus enhancing the performance of data deduplication. Two real data sets that represent typical cloud storage and cloud computing scenarios are collected to evaluate the effectiveness of the proposed approach.
Keywords
cache storage; cloud computing; data handling; disc drives; fingerprint identification; storage management; cache hit ratio; cloud computing scenarios; cloud storage; data deduplication; data locality; data sets; data streams; disk I/O; disk drives; file recognition algorithm; file similarity; fingerprint prefetching algorithm; identical data blocks; limited memory size; on-disk fingerprints; random I/O; redundant data segments; Cloud computing; Data compression; Fingerprint recognition; Indexes; Prefetching; Random access memory; Throughput; deduplication; disk bottleneck; file similarity; fingerprint prefetching; locality;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on
Conference_Location
Zhangjiajie
Type
conf
DOI
10.1109/HPCC.and.EUC.2013.122
Filename
6832004
Link To Document