DocumentCode :
2028961
Title :
A two-stage feature selection method for text categorization
Author :
Meng, Jiana ; Lin, Hongfei
Author_Institution :
Dept. of Comput. Sci. & Eng., Dalian Univ. of Technol., Dalian, China
Volume :
4
fYear :
2010
fDate :
10-12 Aug. 2010
Firstpage :
1492
Lastpage :
1496
Abstract :
Feature selection for text classification is a well-studied problem and the goals are improving classification effectiveness, computational efficiency, or both. In this paper, we propose a two-stage feature selection algorithm based on a kind of feature selection method and latent semantic indexing. Traditional word-matching based text categorization system uses vector space model to represent the document. However, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which can also lead to poor classification accuracy. Latent semantic indexing can overcome the problems caused by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. It not only greatly reduces the dimensionality but also discovers the important associative relationship between terms. Because of the too much calculation time of constructing a new semantic space, in this algorithm, firstly we apply a kind of feature selection method to reduce the term dimensions. Secondly, we construct a new reduced semantic space between terms based on latent semantic indexing method. Through some applications involving spam database categorization, we find that our two-stage feature selection method performs better.
Keywords :
document handling; feature extraction; indexing; pattern classification; support vector machines; text analysis; latent semantic indexing; reduced semantic space; spam database categorization; text categorization; two stage feature selection method; vector space model; word matching based text categorization system; Accuracy; Indexing; Large scale integration; Machine learning; Semantics; Support vector machines; Text categorization; feature selection; latent semantic indexing; support vector space; text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
Conference_Location :
Yantai, Shandong
Print_ISBN :
978-1-4244-5931-5
Type :
conf
DOI :
10.1109/FSKD.2010.5569324
Filename :
5569324
Link To Document :
بازگشت