DocumentCode :
3227919
Title :
Fine Text Categorization: Using Very Aggressive Feature Selection to Cope with Mass Duplicated Features
Author :
DAI, Liuling ; HU, Jinwu ; Wu, ShiKun
Author_Institution :
Sch. of Comput. Sci., Beijing Inst. of Technol., Beijing
Volume :
2
fYear :
2008
fDate :
20-22 Oct. 2008
Firstpage :
984
Lastpage :
988
Abstract :
Text categorization is a key issue of text mining. Although there are many studies on this problem, the majority of them are focused on classification of rough categories. In this kind of problem, there are obviously different features that can differentiate one category from others. Only very few researches concerned fine text categorization (FTC) problem which is characterized by many duplicated features across different categories. In this paper, we firstly pointed out that traditional feature selection levels canpsilat be directly used to cope with this problem. In order to improve performance, we performed very aggressive feature selection (VAFS) by firstly removing the common features arbitrarily, and then selecting features with modified CHI-square statistic in a very aggressive manner. At last, Only very few features are used to learnt the underlying concepts of categories. Experimental results shows that VAFS improves performance notabely and rule based algorithms are more suitable than vector based algorithms.
Keywords :
data mining; knowledge based systems; statistical analysis; text analysis; aggressive feature selection; fine text categorization; mass duplicated features; rough categories; rule based algorithms; text mining; Automation; Information retrieval; Information technology; Laboratories; Machine learning algorithms; Partial response channels; Support vector machine classification; Support vector machines; Text categorization; Text mining; SVM; feature selection; fine text categorization; kNN; rough set;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Computation Technology and Automation (ICICTA), 2008 International Conference on
Conference_Location :
Hunan
Print_ISBN :
978-0-7695-3357-5
Type :
conf
DOI :
10.1109/ICICTA.2008.90
Filename :
4659910
Link To Document :
بازگشت