DocumentCode :
183045
Title :
New feature selection methods based on context similarity for text categorization
Author :
Yifei Chen ; Bingqing Han ; Ping Hou
Author_Institution :
Sch. of Inf. Sci., Nanjing Audit Univ., Nanjing, China
fYear :
2014
fDate :
19-21 Aug. 2014
Firstpage :
598
Lastpage :
604
Abstract :
High dimensionality of the feature space is one of the most important concerns in text categorization problems, and feature selection is widely used for reducing the dimensionality of features to speed up the computation without damaging the performance. However, a lot traditional feature selection methods treat each feature separately, and they are context independent. In order to address the problem, this paper first presents the study of four well known frequency based feature selection methods, including Gini Index (GI), Document Frequency (DF), Class Discriminating Measure (CDM) and Accuracy Balanced (Acc2). Then we focus on calculating the importance of features through measuring the similarity of their contexts among the documents but the document frequency containing these features to incorporate context information. Hence we propose four new context similarity based feature selection methods, GIcs, DFcs, CDMcs and Acc2cs. They are evaluated on different data sets and compared against the four corresponding frequency based methods. Through experimental analysis, the results reveal that the context similarity based methods outperform the corresponding frequency based methods in terms of the micro and macro F1 measures both on binary and multi-classification problems. Benefit from the multi-words information surrounding features, the context similarity based feature selection methods are effective for article categorization.
Keywords :
feature selection; pattern classification; text analysis; Acc2; CDM; DF; GI; Gini index; accuracy balanced; article categorization; binary problem; class discriminating measure; context similarity; dimensionality reduction; document frequency; feature space; frequency based feature selection methods; macroF1 measures; microF1 measures; multiclassification problem; similarity measurement; text categorization; Context; Feature extraction; Frequency measurement; Proteins; Size measurement; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2014 11th International Conference on
Conference_Location :
Xiamen
Print_ISBN :
978-1-4799-5147-5
Type :
conf
DOI :
10.1109/FSKD.2014.6980902
Filename :
6980902
Link To Document :
بازگشت