DocumentCode :
1931855
Title :
Categorical term frequency probability based feature selection for document categorization
Author :
Qiang Li ; Liang He ; Xin Lin
Author_Institution :
Dept. of Comput. Sci. & Technol., East China Normal Univ., Shanghai, China
fYear :
2013
fDate :
15-18 Dec. 2013
Firstpage :
66
Lastpage :
71
Abstract :
Document categorization technology heavily relies on the categorical distribution of features. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. At first, we give the definition of CTFP (Categorical Term Frequency Probability), which will be used to accurately reflect the categorical characteristics of terms on each category. Then, the CTFP_VM (Variance-Mean based on CTFP) feature selection criterion is introduced to reveal the category distribution difference. After computing and ranking the variance mean based on CTFP distribution for each term, feature sets are obtained for document categorization. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20 news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the best feature set for document categorization The experimental results also demonstrate that the proposed variance mean feature selection method base on CTFP not only has better Fl-metric for document categorization but excellent corpus adaptability.
Keywords :
category theory; document handling; feature selection; pattern classification; statistical distributions; CTFP; categorical term frequency probability; category distribution difference; document categorization; feature selection; Algorithm design and analysis; Classification algorithms; Feature extraction; Measurement; Pattern recognition; Support vector machines; Training; categorical distribution; document categorization; feature selection; term frequency; variance mean;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Soft Computing and Pattern Recognition (SoCPaR), 2013 International Conference of
Conference_Location :
Hanoi
Print_ISBN :
978-1-4799-3399-0
Type :
conf
DOI :
10.1109/SOCPAR.2013.7054103
Filename :
7054103
Link To Document :
بازگشت