Word-Based Statistical Compressors as Natural Language Compression Boosters

Author

Farina, A. ; Navarro, Gonzalo ; Param, José R.

Author_Institution

Univ. of A Coruna, A Coruna

fYear

2008

fDate

25-27 March 2008

Firstpage

162

Lastpage

171

Abstract

Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor. In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compression ratios (the best we know of on natural language) and much faster compression/decompression than ppmdi alone. Text indexing also profits from our preprocessing step. A compressed self-index achieves much better space and time performance when preceded by a semistatic word-based compression step. We show, for example, that the AF-FMindex coupled with Tagged Huffman coding is an attractive alternative index for natural language texts.

Keywords

data compression; indexing; natural language processing; text analysis; word processing; Dense coding preprocessing; Dense-code-based compression; compression ratio; direct pattern searching; natural language compression booster; natural language text; semistatic word-based byte-oriented compression code; semistatic word-based compression; tagged Huffman coding; text indexing; word-based byte-oriented statistical compressor; word-based statistical compressor; Compressors; Computer science; Data compression; Databases; Frequency; Huffman coding; Indexing; Natural languages; Predictive models; Statistics; Text compression; compression boosting; indexing;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference, 2008. DCC 2008

Conference_Location

Snowbird, UT

ISSN

1068-0314

Print_ISBN

978-0-7695-3121-2

Type

conf

DOI

10.1109/DCC.2008.14

Filename

4483294