Title :
Improving Thai Academic Web Page Classification Using Inverse Class Frequency and Web Link Information
Author :
Lertnattee, Verayuth ; Theeramunkong, Thanaruk
Author_Institution :
Silpakorn Univ. Sanamchandra, Muang
Abstract :
Automatic text classification for Web collection is a non- trivial task. Since Thai academic Web pages usually present technical articles. They may have many technical terms both in Thai and English. This paper presents two approaches towards the problem of a large number of unique terms in a Web page: 1) term weighting schemes and 2) schemes using Web link information. We propose an approach using inverse class frequency instead of inverse document frequency in centroid-based text categorization. Web link information provides information for users to follow to another part or page. It adds useful unique terms for classification. The experimental results show that inverse class frequency is useful on a set of Thai academic Web documents, which is categorized by sources (sites) of information. It should be applied on both prototype and query vectors. Moreover, Web link information expresses its usefulness when inverse class frequency is also applied.
Keywords :
Internet; Web sites; classification; information use; natural language processing; text analysis; Thai academic Web page classification; Thai academic Web pages; Web collection; Web link information; automatic text classification; centroid-based text categorization; inverse class frequency; inverse document frequency; query vectors; Bayesian methods; Frequency; Information resources; Natural languages; Prototypes; Statistics; Support vector machine classification; Support vector machines; Text categorization; Web pages; Inverse Class Frequency; Text Categorization; Text Classification; Web Link Information;
Conference_Titel :
Advanced Information Networking and Applications - Workshops, 2008. AINAW 2008. 22nd International Conference on
Conference_Location :
Okinawa
Print_ISBN :
978-0-7695-3096-3
DOI :
10.1109/WAINA.2008.120