• DocumentCode
    2500962
  • Title

    Thai personal named entity extraction without using word segmentation or POS tagging

  • Author

    Sutheebanjard, P. ; Premchaiswadi, W.

  • Author_Institution
    Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand
  • fYear
    2009
  • fDate
    20-22 Oct. 2009
  • Firstpage
    221
  • Lastpage
    226
  • Abstract
    Named entity (NE) extraction for Thai language is a difficult and time consuming task because sentences in Thai language are composed of a series of words formed by a stream of characters. Moreover, there are no delimiters (blank space) to show word boundaries. Currently, most named entity extraction methods for Thai language are associated with word segmentation and part of speech (POS) tagging processes. The accuracy of named entity extraction is mostly affected the efficiency of those processes. At present, it is still lack of suitable methods for identifying the boundary of word for Thai sentence. Therefore this paper proposes the method to extract Thai personal named entity without using word segmentation or POS tagging. The proposed method is composed of 3 steps. Firstly, pre-processing, this process is used to remove non alphabet such as parenthesizes and numerical. Then, personal named entity is extracted by using contextual environment, front and rear, of personal name. Finally, post-processing, a simple rule base is employed to identify personal names. The training corpus of 900 political news articles and the test corpus of 100 political news, 100 financial news and 100 sport news articles were used in the experiments. The results showed that the F-measures in political and financial domain are 91.442% and 91.720% respectively which are nearly the same. However, the proposed scheme used neither word segmentation nor POS tagging process that can significantly reduce the effort and speed up the process in building the training corpus.
  • Keywords
    information retrieval; learning (artificial intelligence); natural language processing; text analysis; POS tagging; Thai personal named entity extraction; blank space; contextual environment; financial news article; machine learning; part of speech; political news article; rule base; sport news article; text analysis; word boundary identification; word segmentation; Data mining; Entropy; Feature extraction; Guidelines; Information retrieval; Natural language processing; Natural languages; Tagging; Testing; Text recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing, 2009. SNLP '09. Eighth International Symposium on
  • Conference_Location
    Bangkok
  • Print_ISBN
    978-1-4244-4138-9
  • Electronic_ISBN
    978-1-4244-4139-6
  • Type

    conf

  • DOI
    10.1109/SNLP.2009.5340914
  • Filename
    5340914