• DocumentCode
    2205693
  • Title

    ProSy: A similarity based inline deduplication system for primary storage

  • Author

    Xin Du ; Weizheng Hu ; Qiang Wang ; Fang Wang

  • Author_Institution
    Wuhan National Laboratory for Optoelectronics, School of Computer, Huazhong University of Science and Technology, China
  • fYear
    2015
  • fDate
    6-7 Aug. 2015
  • Firstpage
    195
  • Lastpage
    204
  • Abstract
    Data deduplication can reduce cost and enhance throughput in backup and archiving systems. Recently, it becomes increasingly popular to apply this technique in primary storage systems, where data is actively used by enterprise business applications. However, the state-of-the-art deduplication systems for primary storages mainly provide offline solutions, which require sufficient time-window, additional space and energy. The biggest challenge for an inline deduplication solution is the acceptable performance in terms of data deduplication ratio, access latency, system throughput and management overhead. In this paper, we propose a high accuracy similarity algorithm, and based on it, construct ProSy, a real-time inline deduplication system for primary storage, which can achieve acceptable comprehensive performance without requiring file layout information. Prosy is more reliable since it uses byte-by-byte comparison instead of strong hash comparison to guarantee data integrity. The main idea behind ProSy is to minimize the size of comparison set by grouping similar file segments into the same category when performing data deduplication. For each segment of files, ProSy searches for common data only within the category which this segment belongs to. The experimental evaluation based on real world datasets shows that ProSy is practical and it achieves satisfactory performance. Comparing with the common file system, ProSy can achieve more than 60% of the max data deduplication ratio, 27% deduction on latency, about 2.7% CPU utilization, 83% write throughput and 144% read throughput.
  • Keywords
    Data structures; File systems; Fingerprint recognition; Layout; Metadata; Servers; Throughput; inline deduplication; primary storage; similarity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on
  • Conference_Location
    Boston, MA, USA
  • Type

    conf

  • DOI
    10.1109/NAS.2015.7255230
  • Filename
    7255230