DocumentCode
297455
Title
Entropies of Chinese texts based on three models of Hanyu Pinyin phonetic system
Author
Huang, Shell Ying ; Ong, Ghim Hwee
Author_Institution
Div. of Comput. Technol., Sch. of Appl. Sci., Nanyang Technol. Univ., Singapore
Volume
1
fYear
1993
fDate
6-11 Sep 1993
Firstpage
305
Abstract
Entropy indicates the lower bound to the number of bits required to represent the information in the texts of a language. It is a function of the probability distribution of the language units. A set of language units with their probabilities is just a model of the texts. A different set of language units and probabilities provides a different model. This paper reports on the study of entropies of Chinese texts provided by three models based on the Chinese phonetic system, Hanyu Pinyin. These models yield higher values of entropies than the ideogram-based model. However, Chinese texts transcribed in Hanyu Pinyin are a simple way to do Chinese input and no translation is needed before storage in computer systems. In addition, the coding of frequency table in static and semi-adaptive text compression schemes is much smaller than that for ideograms. This is an important advantage for compression of small to medium sized text files
Keywords
computational linguistics; entropy; speech processing; word processing; Chinese texts; Hanyu Pinyin; entropies; frequency table; phonetic system; text compression schemes; Computer science; Entropy; Frequency; Information systems; Natural languages; Probability distribution; Size measurement;
fLanguage
English
Publisher
ieee
Conference_Titel
Networks, 1993. International Conference on Information Engineering '93. 'Communications and Networks for the Year 2000', Proceedings of IEEE Singapore International Conference on
Print_ISBN
0-7803-1445-X
Type
conf
DOI
10.1109/SICON.1993.515776
Filename
515776
Link To Document