A corpus-based approach for keyword identification using supervised learning techniques

Author

TeCho, Jakkrit ; Nattee, Cholwich ; Theeramunkong, Thanaruk

Author_Institution

Sch. of Inf. & Comput. Technol., Thammasat Univ., Pathumthani

Volume

1

fYear

2008

fDate

14-17 May 2008

Firstpage

33

Lastpage

36

Abstract

This paper presents a corpus-based approach for extracting keywords from a text written in a language that has no word boundary. Based on the concept of Thai character cluster, a Thai running text is preliminarily segmented into a sequence of inseparable units, called TCCs. To enable the handling of a large-scaled text, a sorted sistring (or suffix array) is applied to calculate a number of statistics of each TCC. Using these statistics, we applied three alternative supervised machine learning techniques, naive Bayes, centroid-based and k-NN, to learn classifiers for keyword identification. Our method is evaluated using a medical text extracted from WWW. The result showed that k-NN achieves the highest performance of 79.5 % accuracy.

Keywords

Bayes methods; learning (artificial intelligence); text analysis; word processing; Thai character cluster; Thai running text; keyword identification; medical text; naive Bayes; supervised learning techniques; supervised machine learning techniques; word boundary; Data mining; Dictionaries; Machine learning; Natural languages; Search engines; Statistics; Supervised learning; Text categorization; Web pages; World Wide Web;

fLanguage

English

Publisher

ieee

Conference_Titel

Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on

Conference_Location

Krabi

Print_ISBN

978-1-4244-2101-5

Electronic_ISBN

978-1-4244-2102-2

Type

conf

DOI

10.1109/ECTICON.2008.4600366

Filename

4600366