• DocumentCode
    297455
  • Title

    Entropies of Chinese texts based on three models of Hanyu Pinyin phonetic system

  • Author

    Huang, Shell Ying ; Ong, Ghim Hwee

  • Author_Institution
    Div. of Comput. Technol., Sch. of Appl. Sci., Nanyang Technol. Univ., Singapore
  • Volume
    1
  • fYear
    1993
  • fDate
    6-11 Sep 1993
  • Firstpage
    305
  • Abstract
    Entropy indicates the lower bound to the number of bits required to represent the information in the texts of a language. It is a function of the probability distribution of the language units. A set of language units with their probabilities is just a model of the texts. A different set of language units and probabilities provides a different model. This paper reports on the study of entropies of Chinese texts provided by three models based on the Chinese phonetic system, Hanyu Pinyin. These models yield higher values of entropies than the ideogram-based model. However, Chinese texts transcribed in Hanyu Pinyin are a simple way to do Chinese input and no translation is needed before storage in computer systems. In addition, the coding of frequency table in static and semi-adaptive text compression schemes is much smaller than that for ideograms. This is an important advantage for compression of small to medium sized text files
  • Keywords
    computational linguistics; entropy; speech processing; word processing; Chinese texts; Hanyu Pinyin; entropies; frequency table; phonetic system; text compression schemes; Computer science; Entropy; Frequency; Information systems; Natural languages; Probability distribution; Size measurement;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Networks, 1993. International Conference on Information Engineering '93. 'Communications and Networks for the Year 2000', Proceedings of IEEE Singapore International Conference on
  • Print_ISBN
    0-7803-1445-X
  • Type

    conf

  • DOI
    10.1109/SICON.1993.515776
  • Filename
    515776