• DocumentCode
    2037526
  • Title

    A statistical approach for resolving problematical word boundaries in Chinese lexicography

  • Author

    Kwong, OI Yee ; Tsou, Benjamin K.

  • Author_Institution
    Language Inf. Sci. Res. Centre, City Univ. of Hong Kong, Kowloon, China
  • Volume
    4
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    2199
  • Abstract
    Word segmentation is an important topic in Chinese language processing. Although state-of-the-art segmentation algorithms demonstrate that more than 90% accuracy could possibly be achieved, there remains the subtle question of what constitutes a Chinese word. In this paper, we focus on two-character word strings which often raise doubts even for lexicographers as to whether the two characters should be segmented or kept as one word. We experiment with the feasibility of modelling human judgement on such problematical word boundaries by corpus-based mutual information. Preliminary results show that the strength of correlation between the two measures might be lexically as well as structurally dependent, and mutual information only partially models human judgement on problematic Chinese word boundaries
  • Keywords
    computational linguistics; Chinese lexicography; corpus-based mutual information; human judgement modelling; problematical word boundaries; statistical approach; two-character word strings; word segmentation; Art; Cities and towns; Cultural differences; Humans; Marine animals; Mutual information; Natural language processing; Natural languages; Sun; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man, and Cybernetics, 2001 IEEE International Conference on
  • Conference_Location
    Tucson, AZ
  • ISSN
    1062-922X
  • Print_ISBN
    0-7803-7087-2
  • Type

    conf

  • DOI
    10.1109/ICSMC.2001.972882
  • Filename
    972882