• DocumentCode
    2259706
  • Title

    A statistical method for Uyghur tokenization

  • Author

    Aisha, Batuer ; Sun, Maosong

  • Author_Institution
    Dept. of Comput. Sci.&Tech., Tsinghua Univ., Beijing, China
  • fYear
    2009
  • fDate
    24-27 Sept. 2009
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    Tokenization is very important for Uyghur language processing. Tokenization of Uyghur, an agglutinative language, is quite different from other languages such as Chinese and English. In this paper we propose a two-steps statistical tokenization method for Uyghur. Two related factors, the feature template scheme and the manually tokenized corpora, are also discussed. The preliminary experiment results demonstrate that the proposed method is effective: the F-measure of tokenization reaches 88.9% in the open test.
  • Keywords
    natural language processing; statistical analysis; Uyghur language processing; Uyghur tokenization; agglutinative language; feature template scheme; manually tokenized corpora; statistical method; Automata; Automatic frequency control; Morphology; Natural languages; Shape; Statistical analysis; Testing; MEM; Morpheme; Uyghur; Uyghur Suffix; Uyghur letter; Xinjiang;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
  • Conference_Location
    Dalian
  • Print_ISBN
    978-1-4244-4538-7
  • Electronic_ISBN
    978-1-4244-4540-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2009.5313764
  • Filename
    5313764