DocumentCode
2500962
Title
Thai personal named entity extraction without using word segmentation or POS tagging
Author
Sutheebanjard, P. ; Premchaiswadi, W.
Author_Institution
Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand
fYear
2009
fDate
20-22 Oct. 2009
Firstpage
221
Lastpage
226
Abstract
Named entity (NE) extraction for Thai language is a difficult and time consuming task because sentences in Thai language are composed of a series of words formed by a stream of characters. Moreover, there are no delimiters (blank space) to show word boundaries. Currently, most named entity extraction methods for Thai language are associated with word segmentation and part of speech (POS) tagging processes. The accuracy of named entity extraction is mostly affected the efficiency of those processes. At present, it is still lack of suitable methods for identifying the boundary of word for Thai sentence. Therefore this paper proposes the method to extract Thai personal named entity without using word segmentation or POS tagging. The proposed method is composed of 3 steps. Firstly, pre-processing, this process is used to remove non alphabet such as parenthesizes and numerical. Then, personal named entity is extracted by using contextual environment, front and rear, of personal name. Finally, post-processing, a simple rule base is employed to identify personal names. The training corpus of 900 political news articles and the test corpus of 100 political news, 100 financial news and 100 sport news articles were used in the experiments. The results showed that the F-measures in political and financial domain are 91.442% and 91.720% respectively which are nearly the same. However, the proposed scheme used neither word segmentation nor POS tagging process that can significantly reduce the effort and speed up the process in building the training corpus.
Keywords
information retrieval; learning (artificial intelligence); natural language processing; text analysis; POS tagging; Thai personal named entity extraction; blank space; contextual environment; financial news article; machine learning; part of speech; political news article; rule base; sport news article; text analysis; word boundary identification; word segmentation; Data mining; Entropy; Feature extraction; Guidelines; Information retrieval; Natural language processing; Natural languages; Tagging; Testing; Text recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing, 2009. SNLP '09. Eighth International Symposium on
Conference_Location
Bangkok
Print_ISBN
978-1-4244-4138-9
Electronic_ISBN
978-1-4244-4139-6
Type
conf
DOI
10.1109/SNLP.2009.5340914
Filename
5340914
Link To Document