Title :
RepoZip: A technique for lossless compression of document collections
Author :
Sumanaweera, D.N. ; Doole, F. Fahima ; Pathiraja, D.P. ; Deshapriya, G.G.K. ; Dias, Gihan
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Moratuwa, Moratuwa, Sri Lanka
Abstract :
Many computer systems; especially in corporations, contain large amount of documents such as letters, reports and presentations. Many such documents are present in several versions. Such data needs to be synchronized with branch offices and mobile devices, often over slow and expensive connections. However, as many documents are stored in an already compressed format, it is difficult to compress them further by exploiting the hidden redundancies. We present a novel approach named RepoZip which improves the compression of an existing compression algorithm over a document collection, by exploiting the inter-document meta-data and content-level redundancies. It concentrates on compressing OOXML documents that have been constructed through the archival of a hierarchy of meta-data files and PDF documents which include deflated content streams. Therefore, the RepoZip approach achieves larger compression gains over OOXML document collections or PDF document collections by exploiting usually undetected meta-data level similarities.
Keywords :
data compression; document handling; meta data; mobile computing; PDF document collections; RepoZip; branch offices; compressed format; compressing OOXML documents; compression algorithm; computer systems; content level redundancies; interdocument metadata; lossless compression technique; mobile devices; undetected metadata; Decision support systems; Iossless compression; OOXML; PDF; clusters; generalized suffix tree; meta-data similarity;
Conference_Titel :
Moratuwa Engineering Research Conference (MERCon), 2015
Conference_Location :
Moratuwa
DOI :
10.1109/MERCon.2015.7112368