DocumentCode
183045
Title
New feature selection methods based on context similarity for text categorization
Author
Yifei Chen ; Bingqing Han ; Ping Hou
Author_Institution
Sch. of Inf. Sci., Nanjing Audit Univ., Nanjing, China
fYear
2014
fDate
19-21 Aug. 2014
Firstpage
598
Lastpage
604
Abstract
High dimensionality of the feature space is one of the most important concerns in text categorization problems, and feature selection is widely used for reducing the dimensionality of features to speed up the computation without damaging the performance. However, a lot traditional feature selection methods treat each feature separately, and they are context independent. In order to address the problem, this paper first presents the study of four well known frequency based feature selection methods, including Gini Index (GI), Document Frequency (DF), Class Discriminating Measure (CDM) and Accuracy Balanced (Acc2). Then we focus on calculating the importance of features through measuring the similarity of their contexts among the documents but the document frequency containing these features to incorporate context information. Hence we propose four new context similarity based feature selection methods, GIcs, DFcs, CDMcs and Acc2cs. They are evaluated on different data sets and compared against the four corresponding frequency based methods. Through experimental analysis, the results reveal that the context similarity based methods outperform the corresponding frequency based methods in terms of the micro and macro F1 measures both on binary and multi-classification problems. Benefit from the multi-words information surrounding features, the context similarity based feature selection methods are effective for article categorization.
Keywords
feature selection; pattern classification; text analysis; Acc2; CDM; DF; GI; Gini index; accuracy balanced; article categorization; binary problem; class discriminating measure; context similarity; dimensionality reduction; document frequency; feature space; frequency based feature selection methods; macroF1 measures; microF1 measures; multiclassification problem; similarity measurement; text categorization; Context; Feature extraction; Frequency measurement; Proteins; Size measurement; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Fuzzy Systems and Knowledge Discovery (FSKD), 2014 11th International Conference on
Conference_Location
Xiamen
Print_ISBN
978-1-4799-5147-5
Type
conf
DOI
10.1109/FSKD.2014.6980902
Filename
6980902
Link To Document