Title : 
Word-Based Statistical Compressors as Natural Language Compression Boosters
         
        
            Author : 
Farina, A. ; Navarro, Gonzalo ; Param, José R.
         
        
            Author_Institution : 
Univ. of A Coruna, A Coruna
         
        
        
        
        
        
            Abstract : 
Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor. In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compression ratios (the best we know of on natural language) and much faster compression/decompression than ppmdi alone. Text indexing also profits from our preprocessing step. A compressed self-index achieves much better space and time performance when preceded by a semistatic word-based compression step. We show, for example, that the AF-FMindex coupled with Tagged Huffman coding is an attractive alternative index for natural language texts.
         
        
            Keywords : 
data compression; indexing; natural language processing; text analysis; word processing; Dense coding preprocessing; Dense-code-based compression; compression ratio; direct pattern searching; natural language compression booster; natural language text; semistatic word-based byte-oriented compression code; semistatic word-based compression; tagged Huffman coding; text indexing; word-based byte-oriented statistical compressor; word-based statistical compressor; Compressors; Computer science; Data compression; Databases; Frequency; Huffman coding; Indexing; Natural languages; Predictive models; Statistics; Text compression; compression boosting; indexing;
         
        
        
        
            Conference_Titel : 
Data Compression Conference, 2008. DCC 2008
         
        
            Conference_Location : 
Snowbird, UT
         
        
        
            Print_ISBN : 
978-0-7695-3121-2
         
        
        
            DOI : 
10.1109/DCC.2008.14