Title :
A corpus-based approach for keyword identification using supervised learning techniques
Author :
TeCho, Jakkrit ; Nattee, Cholwich ; Theeramunkong, Thanaruk
Author_Institution :
Sch. of Inf. & Comput. Technol., Thammasat Univ., Pathumthani
Abstract :
This paper presents a corpus-based approach for extracting keywords from a text written in a language that has no word boundary. Based on the concept of Thai character cluster, a Thai running text is preliminarily segmented into a sequence of inseparable units, called TCCs. To enable the handling of a large-scaled text, a sorted sistring (or suffix array) is applied to calculate a number of statistics of each TCC. Using these statistics, we applied three alternative supervised machine learning techniques, naive Bayes, centroid-based and k-NN, to learn classifiers for keyword identification. Our method is evaluated using a medical text extracted from WWW. The result showed that k-NN achieves the highest performance of 79.5 % accuracy.
Keywords :
Bayes methods; learning (artificial intelligence); text analysis; word processing; Thai character cluster; Thai running text; keyword identification; medical text; naive Bayes; supervised learning techniques; supervised machine learning techniques; word boundary; Data mining; Dictionaries; Machine learning; Natural languages; Search engines; Statistics; Supervised learning; Text categorization; Web pages; World Wide Web;
Conference_Titel :
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
Conference_Location :
Krabi
Print_ISBN :
978-1-4244-2101-5
Electronic_ISBN :
978-1-4244-2102-2
DOI :
10.1109/ECTICON.2008.4600366