Title :
Fine Text Categorization: Using Very Aggressive Feature Selection to Cope with Mass Duplicated Features
Author :
DAI, Liuling ; HU, Jinwu ; Wu, ShiKun
Author_Institution :
Sch. of Comput. Sci., Beijing Inst. of Technol., Beijing
Abstract :
Text categorization is a key issue of text mining. Although there are many studies on this problem, the majority of them are focused on classification of rough categories. In this kind of problem, there are obviously different features that can differentiate one category from others. Only very few researches concerned fine text categorization (FTC) problem which is characterized by many duplicated features across different categories. In this paper, we firstly pointed out that traditional feature selection levels canpsilat be directly used to cope with this problem. In order to improve performance, we performed very aggressive feature selection (VAFS) by firstly removing the common features arbitrarily, and then selecting features with modified CHI-square statistic in a very aggressive manner. At last, Only very few features are used to learnt the underlying concepts of categories. Experimental results shows that VAFS improves performance notabely and rule based algorithms are more suitable than vector based algorithms.
Keywords :
data mining; knowledge based systems; statistical analysis; text analysis; aggressive feature selection; fine text categorization; mass duplicated features; rough categories; rule based algorithms; text mining; Automation; Information retrieval; Information technology; Laboratories; Machine learning algorithms; Partial response channels; Support vector machine classification; Support vector machines; Text categorization; Text mining; SVM; feature selection; fine text categorization; kNN; rough set;
Conference_Titel :
Intelligent Computation Technology and Automation (ICICTA), 2008 International Conference on
Conference_Location :
Hunan
Print_ISBN :
978-0-7695-3357-5
DOI :
10.1109/ICICTA.2008.90