DocumentCode :
1581826
Title :
Improving Thai word segmentation with Named Entity Recognition
Author :
Tepdang, Sayan ; Haruechaiyasak, Choochart ; Kongkachandra, Rachada
Author_Institution :
Dept. of Comput. Sci., Thammasat Univ., Bangkok, Thailand
fYear :
2010
Firstpage :
940
Lastpage :
945
Abstract :
Segmenting words in Thai language is a very difficult task since there is no distinguished clue such as blank, period and other punctuations as in English. Several previous researches employed dictionary as the main resource for consideration. However there still exist two problems including ambiguous words and unknown words. These unknown words can be categorized into two groups, -i.e., newly defined words and named entities. This paper presents an approach for improving the performance of Thai word segmentation by merging Named Entity Recognition (NER) to the Thai word segmentation. The Conditional Random Fields (CRFs) algorithm is applied for training and recognizing Thai named entities. The prefixes and suffixes of Thai named entities are selected as main features for learning the models. The performance evaluations are experimented by using the Thai standard word segmentation corpus, namely BEST2009, which consists of 5 million words. Various word-level grams (i.e., three, five and seven) are also employed to construct the Thai NER models. The experimental results show that the 7-gram NER model provides the best performance. Merging the proposed NER model to the Thai word segmentation called TLex (Thai Lexeme Analyzer) can improve the performance measured by F1-measure from 92.39% to 93.96%.
Keywords :
natural language processing; word processing; Thai language; Thai word segmentation; conditional random field; named entity recognition; word level grams; Accuracy; Biological system modeling; Dictionaries; Encyclopedias; Internet; Merging; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Communications and Information Technologies (ISCIT), 2010 International Symposium on
Conference_Location :
Tokyo
Print_ISBN :
978-1-4244-7007-5
Electronic_ISBN :
978-1-4244-7009-9
Type :
conf
DOI :
10.1109/ISCIT.2010.5665124
Filename :
5665124
Link To Document :
بازگشت