ProSy: A similarity based inline deduplication system for primary storage

Author

Xin Du ; Weizheng Hu ; Qiang Wang ; Fang Wang

Author_Institution

Wuhan National Laboratory for Optoelectronics, School of Computer, Huazhong University of Science and Technology, China

fYear

2015

fDate

6-7 Aug. 2015

Firstpage

195

Lastpage

204

Abstract

Data deduplication can reduce cost and enhance throughput in backup and archiving systems. Recently, it becomes increasingly popular to apply this technique in primary storage systems, where data is actively used by enterprise business applications. However, the state-of-the-art deduplication systems for primary storages mainly provide offline solutions, which require sufficient time-window, additional space and energy. The biggest challenge for an inline deduplication solution is the acceptable performance in terms of data deduplication ratio, access latency, system throughput and management overhead. In this paper, we propose a high accuracy similarity algorithm, and based on it, construct ProSy, a real-time inline deduplication system for primary storage, which can achieve acceptable comprehensive performance without requiring file layout information. Prosy is more reliable since it uses byte-by-byte comparison instead of strong hash comparison to guarantee data integrity. The main idea behind ProSy is to minimize the size of comparison set by grouping similar file segments into the same category when performing data deduplication. For each segment of files, ProSy searches for common data only within the category which this segment belongs to. The experimental evaluation based on real world datasets shows that ProSy is practical and it achieves satisfactory performance. Comparing with the common file system, ProSy can achieve more than 60% of the max data deduplication ratio, 27% deduction on latency, about 2.7% CPU utilization, 83% write throughput and 144% read throughput.

Keywords

Data structures; File systems; Fingerprint recognition; Layout; Metadata; Servers; Throughput; inline deduplication; primary storage; similarity;

fLanguage

English

Publisher

ieee

Conference_Titel

Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on

Conference_Location

Boston, MA, USA

Type

conf

DOI

10.1109/NAS.2015.7255230

Filename

7255230