Title :
A Study for Important Criteria of Feature Selection in Text Categorization
Author_Institution :
Beijing Language & Culture Univ., Beijing, China
Abstract :
A major difficulty of text categorization is the high dimensionality of the feature space. Feature selection is an important step in text categorization to reduce the feature space. Empirical studies of text categorization show that good text categorization performance is related to some feature selection criteria, and when a criterion is not satisfied, it often indicates non-optimality of the method. According to our analysis, there are some reasons for good performance of feature selection in text categorization tasks: favoring common terms, using category information and using term frequency information), and so on. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization, but none of them satisfies all the criteria above. In this paper, we present some Important criteria of FS in TC. Experimental results indicate that the empirical performance of a FS function is tightly related to how well it satisfies these criteria.
Keywords :
text analysis; category information; document frequency thresholding; feature selection criteria; feature space; frequency information; information gain; mutual information; text categorization; Availability; Document handling; Frequency measurement; Gain measurement; Information analysis; Mutual information; Organizing; Performance analysis; Space technology; Text categorization;
Conference_Titel :
Intelligent Systems and Applications (ISA), 2010 2nd International Workshop on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-5872-1
Electronic_ISBN :
978-1-4244-5874-5
DOI :
10.1109/IWISA.2010.5473381