Title :
Categorical Document Frequency Based Feature Selection for Text Categorization
Author :
Zhen, Zhilong ; Wang, Haijuan ; Han, Lixin ; Shi, Zhan
Author_Institution :
Coll. of Comput. & Inf., Hohai Univ., Nanjing, China
Abstract :
Effective feature selection methods are essential for improving the accuracy and efficiency of text categorization. Motivated by document frequency, we proposed a new filter-based feature selection approach, called categorical document frequency. The categorical document frequency displays the distribution of a term over each category. Mathematically, the variance of a term reflects the contribution of the term to categorization. Finally, the experiments are carried out on the Reuters-21578 standard text corpus. The results showed that the categorization performance of the proposed approach is similar or better than information gain and chi-square statistic. In addition, computational cost of this approach is lower than information gain and chi-square so that it is also well-suited for processing large-scale text data.
Keywords :
Internet; feature extraction; statistical analysis; text analysis; Internet; Reuters-21578 standard text corpus; categorical document frequency; categorical document frequency based feature selection method; chi-square statistic; filter-based feature selection approach; information gain; text categorization; Accuracy; Frequency measurement; Information filters; Machine learning; Text categorization; Training; categorical document frequency; feature selection; filter; text categorization;
Conference_Titel :
Information Technology, Computer Engineering and Management Sciences (ICM), 2011 International Conference on
Conference_Location :
Nanjing, Jiangsu
Print_ISBN :
978-1-4577-1419-1
DOI :
10.1109/ICM.2011.365