• DocumentCode
    475308
  • Title

    A corpus-based approach for keyword identification using supervised learning techniques

  • Author

    TeCho, Jakkrit ; Nattee, Cholwich ; Theeramunkong, Thanaruk

  • Author_Institution
    Sch. of Inf. & Comput. Technol., Thammasat Univ., Pathumthani
  • Volume
    1
  • fYear
    2008
  • fDate
    14-17 May 2008
  • Firstpage
    33
  • Lastpage
    36
  • Abstract
    This paper presents a corpus-based approach for extracting keywords from a text written in a language that has no word boundary. Based on the concept of Thai character cluster, a Thai running text is preliminarily segmented into a sequence of inseparable units, called TCCs. To enable the handling of a large-scaled text, a sorted sistring (or suffix array) is applied to calculate a number of statistics of each TCC. Using these statistics, we applied three alternative supervised machine learning techniques, naive Bayes, centroid-based and k-NN, to learn classifiers for keyword identification. Our method is evaluated using a medical text extracted from WWW. The result showed that k-NN achieves the highest performance of 79.5 % accuracy.
  • Keywords
    Bayes methods; learning (artificial intelligence); text analysis; word processing; Thai character cluster; Thai running text; keyword identification; medical text; naive Bayes; supervised learning techniques; supervised machine learning techniques; word boundary; Data mining; Dictionaries; Machine learning; Natural languages; Search engines; Statistics; Supervised learning; Text categorization; Web pages; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
  • Conference_Location
    Krabi
  • Print_ISBN
    978-1-4244-2101-5
  • Electronic_ISBN
    978-1-4244-2102-2
  • Type

    conf

  • DOI
    10.1109/ECTICON.2008.4600366
  • Filename
    4600366