• DocumentCode
    249341
  • Title

    Using Inter-file Similarity to Improve Intra-file Compression

  • Author

    Molfetas, Angelos ; Wirth, Andreas ; Zobel, Justin

  • Author_Institution
    Dept. of Comput. & Inf. Syst., Univ. of Melbourne, Melbourne, VIC, Australia
  • fYear
    2014
  • fDate
    June 27 2014-July 2 2014
  • Firstpage
    192
  • Lastpage
    199
  • Abstract
    In storage systems with vast numbers of files, compression techniques should exploit of inter-file similarity, while allowing for near-atomic access to individual files. In differential compression, collections of files are compressed by identifying shared common strings. Therefore, some files are represented largely by references to strings in other files. In addition, a file in the collection can be (further) compressed by identifying common strings within the file itself. At the cost of decompression latency, but a possible gain in compression effectiveness, an LZ-style within-file compressor could resolve these references to other files. To quantify the compression gain, we experiment with a variety of file collections, from emails to source code, and test against multiple measures. If the LZ scheme honors the inter-file references, then there is only minimal improvement. If the LZ algorithm replaces inter-file references with intra-file references, then up to 3% compression improvement is witnessed for mildly similar files, and over 200% improvement for highly similar files.
  • Keywords
    data compression; source code (software); storage management; LZ algorithm; LZ-style within-file compressor; compression effectiveness; decompression latency; differential compression; e-mails; file collection compression; interfile similarity; intrafile compression; intrafile references; near-atomic access; shared common strings; source code; storage systems; Compression algorithms; Dictionaries; Electronic mail; Encoding; Encyclopedias; Indexes; Measurement; Differential compression; LZ factorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (BigData Congress), 2014 IEEE International Congress on
  • Conference_Location
    Anchorage, AK
  • Print_ISBN
    978-1-4799-5056-0
  • Type

    conf

  • DOI
    10.1109/BigData.Congress.2014.35
  • Filename
    6906778