DocumentCode :
3061833
Title :
Improving text compression ratios with the Burrows-Wheeler transform
Author :
Kruse, Holger ; Mukherjee, Amar
Author_Institution :
Dept. of Comput. Sci., Central Florida Univ., Orlando, FL, USA
fYear :
1999
fDate :
29-31 Mar 1999
Firstpage :
536
Abstract :
[Summary form only given]. In this paper we describe several methods that can be used to improve the compression ratio of compression algorithms based an the Burrows-Wheeler transform, as implemented in `bzip2´, when used in combination with English language text files. We first briefly describe the Burrows-Wheeler transform and some of its strengths and weaknesses. We then describe our implementations of enhancements and test results. All of our enhancements are implemented in the form of pre- and postprocessing algorithms that are used in addition to the transform. The use of our algorithms has resulted in improvements in compression ratios of up to 20% over bzip2. The first method involves changing the order of characters in the character set encoding of the input file. We describe different ways to improve the encoding order. A more elaborate method involves minimizing the contextual differences between subsequent codes, using an optimized encoding order obtained by solving a traveling salesman problem. In the second method we replaced frequently occurring bigrams by single unused character codes, in order to better utilize the available character encoding space. In most cases this method did not result in improvements in the compression ratio. The final method is a refinement of the method we presented in “Preprocessing Text to Improve Compression Ratios”, DCC 98. A common English language dictionary is used to preprocess input text and replace English words with encodings that are more easily compressible by bzip2. In this paper we describe optimizations to our original method to further improve the compression ratio. We also show practical ways to use our method on wide-area networks like the Internet, and easy ways to solve the problem of dictionary distribution and caching
Keywords :
Internet; cache storage; data compression; optimisation; text analysis; transform coding; transforms; wide area networks; Burrows-Wheeler transform; English language dictionary; English language text files; Internet; bigrams; bzip; character set encoding; compression algorithms; compression program; consonants; dictionary caching; dictionary distribution; input file; optimized encoding order; single unused character codes; text compression ratios; traveling salesman problem; vowels; wide-area networks; Compression algorithms; Data preprocessing; Dictionaries; Encoding; IP networks; Natural languages; Optimization methods; Space exploration; Testing; Traveling salesman problems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 1999. Proceedings. DCC '99
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-7695-0096-X
Type :
conf
DOI :
10.1109/DCC.1999.785693
Filename :
785693
Link To Document :
بازگشت