Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication

Author

Liangshan Song ; Yuhui Deng ; Junjie Xie

Author_Institution

Dept. of Comput. Sci., Jinan Univ., Guangzhou, China

fYear

2013

fDate

13-15 Nov. 2013

Firstpage

849

Lastpage

856

Abstract

Data deduplication has become an important and economic way to remove the redundant data segments, thus alleviating the pressure incurred by large amounts of data need to store. Fingerprints are used to represent and identify identical data blocks when performing data deduplication. However, the amount of fingerprints grows with the increase of data. Due to the limited memory size, the fingerprints have to be stored in disk drives. When the fingerprints are not satisfied in memory, disk I/Os will be generated to obtain the on-disk fingerprints. This results in small and random I/Os, thus significantly degrading the performance of data deduplication. This paper introduces a fingerprint prefetching algorithm by leveraging file similarity and data locality. On the one hand, we present a similar file recognition algorithm to identify the similar files that are considered to have some modifications and share a large portion of identical data blocks. On the other hand, the on-disk fingerprints are organized according to the sequence of data streams, thus maintaining the data locality to improve the cache hit ratio. The proposed prefetching algorithm will request fingerprints from disk drives and place them in memory before they are actually needed. This will significantly improve the cache hit ratio when the fingerprints are actually needed, thus enhancing the performance of data deduplication. Two real data sets that represent typical cloud storage and cloud computing scenarios are collected to evaluate the effectiveness of the proposed approach.

Keywords

cache storage; cloud computing; data handling; disc drives; fingerprint identification; storage management; cache hit ratio; cloud computing scenarios; cloud storage; data deduplication; data locality; data sets; data streams; disk I/O; disk drives; file recognition algorithm; file similarity; fingerprint prefetching algorithm; identical data blocks; limited memory size; on-disk fingerprints; random I/O; redundant data segments; Cloud computing; Data compression; Fingerprint recognition; Indexes; Prefetching; Random access memory; Throughput; deduplication; disk bottleneck; file similarity; fingerprint prefetching; locality;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on

Conference_Location

Zhangjiajie

Type

conf

DOI

10.1109/HPCC.and.EUC.2013.122

Filename

6832004