Title :
Compressing Chinese text files using an adaptive Huffman coding scheme and a static dictionary of character pairs
Author :
Ong, Ghim Hwee ; Chong, Wing Teck
Author_Institution :
Dept. of Inf. Syst. & Comput. Sci., Nat. Univ. of Singapore, Singapore
Abstract :
The compression method for Chinese text files proposed in this paper is based on a single pass data compression technique, adaptive Huffman coding. All Chinese text files to be compressed are modeled to contain not only ASCII characters, Chinese ideographic characters and punctuation marks, but also commonly used Chinese character pairs. The approach of using a static dictionary is employed to maintain about 3000 most frequently occurring character pairs found in general Chinese texts. This is to define the extension to the standard source alphabet in ideogram-based adaptive Huffman coding. The performance in compression ratio and CPU execution time of the proposed method is evaluated against those of the adaptive byte-oriented Huffman coding scheme, the adaptive ideogram-based Huffman coding scheme, and the adaptive LZW method. The experimental results have shown that the proposed method based on adaptive Huffman coding with an extended source alphabet yields better compression on Chinese text files
Keywords :
Huffman codes; adaptive codes; character sets; computational complexity; data compression; word processing; CPU execution time; Chinese character pairs; Chinese text files; adaptive Huffman coding; compression ratio; extended source alphabet; single pass data compression; static dictionary; Arithmetic; Computer science; Context modeling; Data compression; Dictionaries; Encoding; Frequency; Huffman coding; Information systems; Natural languages;
Conference_Titel :
Networks, 1993. International Conference on Information Engineering '93. 'Communications and Networks for the Year 2000', Proceedings of IEEE Singapore International Conference on
Print_ISBN :
0-7803-1445-X
DOI :
10.1109/SICON.1993.515699