• DocumentCode
    3243882
  • Title

    A Novel Optimization Method to Improve De-duplication Storage System Performance

  • Author

    Liu, Chuanyi ; Xue, Yibo ; Ju, Dapeng ; Wang, Dongsheng

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2009
  • fDate
    8-11 Dec. 2009
  • Firstpage
    228
  • Lastpage
    235
  • Abstract
    Data De-duplication has become a commodity component in data-intensive storage systems. But compared with other traditional storage paradigms, de-duplication system achieves elimination of data duplications or redundancies at the cost of bringing several additional layers or function components into the I/O path, and these additional components are either CPU-intensive or I/O intensive, largely hindering the overall system performance. Direct against the above potential system bottlenecks, this paper quantitatively analyzes the overhead of each main component introduced by de-duplication, and then proposes two performance optimization methods. The one is parallel calculation of content aware chunk identifiers, which fully utilizes the parallelism both inter and intra chunks by using a certain task partition and chunk content distribution algorithm. Experiments demonstrate that it can improve up to 150% of the system throughput, and at the same time much better utilize the multiprocessor resources. The other one is storage pipelining, which overlaps the CPU-bound, I/O-bound and network communication tasks. Through a dedicated five-stage storage pipeline design for file archival operations, experimental results show that the system throughput can increase up to 25% according to our workloads.
  • Keywords
    data compression; parallel programming; storage allocation; chunk content distribution algorithm; data de-duplication; data-intensive storage systems; parallel calculation; performance optimization methods; task partition algorithm; Computer science; Cost function; Cryptography; Information science; Laboratories; Optimization methods; Performance analysis; Pipeline processing; System performance; Throughput; Data De-duplication; Parallel Hash; Performance Optimization; Storage Pipeline;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
  • Conference_Location
    Shenzhen
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4244-5788-5
  • Type

    conf

  • DOI
    10.1109/ICPADS.2009.103
  • Filename
    5395260