DocumentCode :
3301649
Title :
Dimension reduction based on categorical fuzzy correlation degree for document categorization
Author :
Qiang Li ; Liang He ; Xin Lin
Author_Institution :
Dept. of Comput. Sci. & Technol., East China Normal Univ., Shanghai, China
fYear :
2013
fDate :
13-15 Dec. 2013
Firstpage :
186
Lastpage :
190
Abstract :
High dimensionality of the feature space is a common problem in document categorization. Most of the features obtained through conventional feature selection algorithms such as IG are relevant and redundant. In this paper, a two-step feature selection method is proposed. At the first step redundancy analysis among original features based on categorical fuzzy correlation degree is applied to filter the redundant features with the similar categorical term frequency distribution. In the second step, conventional IG feature selection algorithm is adopted to select the final feature set for document categorization. Experiments dealing with the well-known Reuters-21578 and 20news-18828 corpuses show that the proposed method can eliminate redundant features with high fuzzy correlation degree between each other and obtain a compressed feature space where the dimension of feature space is dramatically reduced. The document categorization results on two corpuses show that the conventional IG feature selection algorithm can achieve a better document categorization performance on the compressed feature space and demonstrate the effectiveness of the proposed method.
Keywords :
category theory; classification; document handling; feature selection; fuzzy set theory; information filtering; 20news-18828 corpuses; IG feature selection algorithm; Reuters-21578 corpuses; categorical fuzzy correlation degree; categorical term frequency distribution; compressed feature space; dimension reduction; document categorization performance; feature set selection; feature space dimension; feature space high dimensionality; redundancy analysis; redundant features elimination; redundant features filter; two-step feature selection method; Algorithm design and analysis; Classification algorithms; Correlation; Frequency measurement; Fuzzy sets; Redundancy; Text categorization; document categorization; feature selection; fuzzy correlation degree; redundancy; relevance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Granular Computing (GrC), 2013 IEEE International Conference on
Conference_Location :
Beijing
Type :
conf
DOI :
10.1109/GrC.2013.6740405
Filename :
6740405
Link To Document :
بازگشت