DocumentCode
249341
Title
Using Inter-file Similarity to Improve Intra-file Compression
Author
Molfetas, Angelos ; Wirth, Andreas ; Zobel, Justin
Author_Institution
Dept. of Comput. & Inf. Syst., Univ. of Melbourne, Melbourne, VIC, Australia
fYear
2014
fDate
June 27 2014-July 2 2014
Firstpage
192
Lastpage
199
Abstract
In storage systems with vast numbers of files, compression techniques should exploit of inter-file similarity, while allowing for near-atomic access to individual files. In differential compression, collections of files are compressed by identifying shared common strings. Therefore, some files are represented largely by references to strings in other files. In addition, a file in the collection can be (further) compressed by identifying common strings within the file itself. At the cost of decompression latency, but a possible gain in compression effectiveness, an LZ-style within-file compressor could resolve these references to other files. To quantify the compression gain, we experiment with a variety of file collections, from emails to source code, and test against multiple measures. If the LZ scheme honors the inter-file references, then there is only minimal improvement. If the LZ algorithm replaces inter-file references with intra-file references, then up to 3% compression improvement is witnessed for mildly similar files, and over 200% improvement for highly similar files.
Keywords
data compression; source code (software); storage management; LZ algorithm; LZ-style within-file compressor; compression effectiveness; decompression latency; differential compression; e-mails; file collection compression; interfile similarity; intrafile compression; intrafile references; near-atomic access; shared common strings; source code; storage systems; Compression algorithms; Dictionaries; Electronic mail; Encoding; Encyclopedias; Indexes; Measurement; Differential compression; LZ factorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data (BigData Congress), 2014 IEEE International Congress on
Conference_Location
Anchorage, AK
Print_ISBN
978-1-4799-5056-0
Type
conf
DOI
10.1109/BigData.Congress.2014.35
Filename
6906778
Link To Document