DocumentCode :
3055459
Title :
Japanese text compression using word-based coding
Author :
Morihara, T. ; Satoh, N. ; Yahagi, H. ; Yoshida, S.
Author_Institution :
Fujitsu Labs. Ltd., Atsugi, Japan
fYear :
1998
fDate :
30 Mar-1 Apr 1998
Firstpage :
564
Abstract :
Summary form only given. Since Japanese characters are encoded in 16-bit, their large sizes have made compression using 8-bit character sampling coding methods difficult. At DCC´97, Satoh et al. (1997) reported that the 16-bit character sampling adaptive arithmetic is effective in improving the compression ratio. However, the adaptive compression method does not work well on small sized documents which are produced in the office by groupware and E-mail. The present paper studies a word-based semi-adaptive compression method for Japanese text for the purpose of obtaining good compression performance on various document sizes. The algorithm is composed of two stages. The first stage converts input strings into the word-index numbers (intermediate data) corresponding to the longest matching strings in the dictionary. The second stage reduces the redundancy of the intermediate data. We adopted a 16-bit word-index, and first order context 16-bit sampling PPMC2 (16 bit-PPM) for entropy coding in the second stage
Keywords :
adaptive codes; data compression; entropy codes; 16 bit; 16-bit word-index; E-mail; Japanese text compression; compression performance; document size; entropy coding; first order context 16-bit sampling PPMC2; first stage; groupware; input strings; intermediate data; office; redundancy; second stage; small sized documents; word-based coding; word-based semi-adaptive compression method; word-index numbers; Arithmetic; Collaborative software; Collaborative work; Computer simulation; Dictionaries; Entropy coding; Laboratories; Natural language processing; Natural languages; Sampling methods;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 1998. DCC '98. Proceedings
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-8186-8406-2
Type :
conf
DOI :
10.1109/DCC.1998.672306
Filename :
672306
Link To Document :
بازگشت