مرکز منطقه ای اطلاع رساني علوم و فناوري - A statistical method for Uyghur tokenization

DocumentCode :

2259706

Title :

A statistical method for Uyghur tokenization

Author :

Aisha, Batuer ; Sun, Maosong

Author_Institution :

Dept. of Comput. Sci.&Tech., Tsinghua Univ., Beijing, China

fYear :

2009

fDate :

24-27 Sept. 2009

Firstpage :

Lastpage :

Abstract :

Tokenization is very important for Uyghur language processing. Tokenization of Uyghur, an agglutinative language, is quite different from other languages such as Chinese and English. In this paper we propose a two-steps statistical tokenization method for Uyghur. Two related factors, the feature template scheme and the manually tokenized corpora, are also discussed. The preliminary experiment results demonstrate that the proposed method is effective: the F-measure of tokenization reaches 88.9% in the open test.

Keywords :

natural language processing; statistical analysis; Uyghur language processing; Uyghur tokenization; agglutinative language; feature template scheme; manually tokenized corpora; statistical method; Automata; Automatic frequency control; Morphology; Natural languages; Shape; Statistical analysis; Testing; MEM; Morpheme; Uyghur; Uyghur Suffix; Uyghur letter; Xinjiang;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on

Conference_Location :

Dalian

Print_ISBN :

978-1-4244-4538-7

Electronic_ISBN :

978-1-4244-4540-0

Type :

conf

DOI :

10.1109/NLPKE.2009.5313764

Filename :

5313764

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2259706