Author_Institution :
Wuhan National Laboratory for Optoelectronics, School of Computer, Huazhong University of Science and Technology, China
Abstract :
Data deduplication can reduce cost and enhance throughput in backup and archiving systems. Recently, it becomes increasingly popular to apply this technique in primary storage systems, where data is actively used by enterprise business applications. However, the state-of-the-art deduplication systems for primary storages mainly provide offline solutions, which require sufficient time-window, additional space and energy. The biggest challenge for an inline deduplication solution is the acceptable performance in terms of data deduplication ratio, access latency, system throughput and management overhead. In this paper, we propose a high accuracy similarity algorithm, and based on it, construct ProSy, a real-time inline deduplication system for primary storage, which can achieve acceptable comprehensive performance without requiring file layout information. Prosy is more reliable since it uses byte-by-byte comparison instead of strong hash comparison to guarantee data integrity. The main idea behind ProSy is to minimize the size of comparison set by grouping similar file segments into the same category when performing data deduplication. For each segment of files, ProSy searches for common data only within the category which this segment belongs to. The experimental evaluation based on real world datasets shows that ProSy is practical and it achieves satisfactory performance. Comparing with the common file system, ProSy can achieve more than 60% of the max data deduplication ratio, 27% deduction on latency, about 2.7% CPU utilization, 83% write throughput and 144% read throughput.