DocumentCode
2259706
Title
A statistical method for Uyghur tokenization
Author
Aisha, Batuer ; Sun, Maosong
Author_Institution
Dept. of Comput. Sci.&Tech., Tsinghua Univ., Beijing, China
fYear
2009
fDate
24-27 Sept. 2009
Firstpage
1
Lastpage
5
Abstract
Tokenization is very important for Uyghur language processing. Tokenization of Uyghur, an agglutinative language, is quite different from other languages such as Chinese and English. In this paper we propose a two-steps statistical tokenization method for Uyghur. Two related factors, the feature template scheme and the manually tokenized corpora, are also discussed. The preliminary experiment results demonstrate that the proposed method is effective: the F-measure of tokenization reaches 88.9% in the open test.
Keywords
natural language processing; statistical analysis; Uyghur language processing; Uyghur tokenization; agglutinative language; feature template scheme; manually tokenized corpora; statistical method; Automata; Automatic frequency control; Morphology; Natural languages; Shape; Statistical analysis; Testing; MEM; Morpheme; Uyghur; Uyghur Suffix; Uyghur letter; Xinjiang;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location
Dalian
Print_ISBN
978-1-4244-4538-7
Electronic_ISBN
978-1-4244-4540-0
Type
conf
DOI
10.1109/NLPKE.2009.5313764
Filename
5313764
Link To Document