Title :
Study on mult-lingual LZ77 and LZ78 text compression
Author_Institution :
Dept. of Inf. Syst. & Comput. Sci., Nat. Univ. of Singapore, Singapore
fDate :
30 Mar-1 Apr 1998
Abstract :
Summary form only given. We studied the effectiveness of this multi-lingual character sampling on Lempel-Ziv (LZ) compression algorithms. LZSS and LZW algorithms were chosen to represent LZ77 and LZ78 compression respectively in the study. They were modified to adapt the characteristics of non-English information such as Chinese. It is interesting to see that the Chinese LZW compression outperforms the original one by a larger percentage than the Chinese LZSS compression does (14.5% vs. 3.7% on average). CLZW also performs better than CLZSS. This can be explained by two factors: the overall dictionary size and the constraints in each of the algorithms. The dictionary size in the LZ78 algorithms or the sliding window size in the LZ77 algorithms determines how much previous content that the compressor can make use of in order to find repeated phrases. Our result shows that the Chinese LZ78 compressor can make use of a larger dictionary much more effectively than the sliding window in LZ77 family does without introducing any bad side-effects. This also illustrates that previous content is particularly helpful in compressing Chinese text. In terms of the linguistic structure of the Chinese language, the occurrence of repeated phrases in Chinese text does not occur as often as that in English. In other words, within a small, fixed amount of text, it is easier to find repeated phrases in English text than that in Chinese text. Since the LZW preserves large volume of previous content, the Chinese implementation can make good use of it. The difference constraints in the two algorithms also contribute to their performance difference. From the analysis, we can conclude that the LZ88 algorithm (and thus the LZW) is a more suitable Lempel-Ziv family to extend for multi-lingual text compression than the LZ77 does
Keywords :
data compression; document image processing; image coding; image sampling; Chinese; LZSS algorithm; LZW algorithm; Lempel-Ziv compression algorithms; dictionary size; mult-lingual text compression; multi-lingual character sampling; nonEnglish information; performance; repeated phrases; sliding window size; Algorithm design and analysis; Compression algorithms; Computer science; Dictionaries; Information systems; Natural languages; Sampling methods;
Conference_Titel :
Data Compression Conference, 1998. DCC '98. Proceedings
Conference_Location :
Snowbird, UT
Print_ISBN :
0-8186-8406-2
DOI :
10.1109/DCC.1998.672254