Title :
KETW Key Terms Extraction and Term Weighting for Newsgroup Message Classification
Author_Institution :
Modern Educ. Technol. Center, Zhejiang Gongshang Univ., Hangzhou, China
Abstract :
Messages in newsgroups are hierarchically organized into several thousands of groups such as soc.religion, alt.atheism, etc. Automatic text classification can help users determine the most relevant group to post a message. However, because of the huge size of newsgroup data sets, high time complexity and low classification accuracy are great challenges for automatic message classification. To solve these problems, we propose a system named key terms extraction and term weighting (KETW) to automatically classify new messages for large newsgroup data sets. The system consists of two parts: (1) utilize the special properties of newsgroup data to extract representative terms for each group and use these terms for training rather than the whole vocabulary; (2) give terms different weights according to their importance. Our experimental results demonstrate that the technique of refining the training set reduces to 10% of the storage and still achieve good performance.
Keywords :
information resources; text analysis; KETW; automatic text classification; key terms extraction; newsgroup message classification; term weighting; Data mining; Educational technology; Frequency; Java; Machine learning; Motion pictures; Performance gain; Software engineering; Text categorization; Vocabulary;
Conference_Titel :
Software Engineering, 2009. WCSE '09. WRI World Congress on
Conference_Location :
Xiamen
Print_ISBN :
978-0-7695-3570-8
DOI :
10.1109/WCSE.2009.9