DocumentCode :
3782265
Title :
Word-based compression methods for large text documents
Author :
J. Dvorsky;J. Pokorny;V. Snasel
Author_Institution :
Dept. of Comput. Sci., Olomouc Palacky Univ., Czechoslovakia
fYear :
1999
Firstpage :
523
Abstract :
Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.
Keywords :
"Compression algorithms","Data compression","Software engineering","Computer science","Spatial databases","Data structures","Testing","Standards publication","Conference management","Image coding"
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 1999. Proceedings. DCC ´99
ISSN :
1068-0314
Print_ISBN :
0-7695-0096-X
Type :
conf
DOI :
10.1109/DCC.1999.785680
Filename :
785680
Link To Document :
بازگشت