DocumentCode
2216512
Title
Feature Selection with Maximum Information Metric in Text Categorization
Author
Wang, Haijuan ; Han, Lixin ; Zeng, Xiaoqin ; Zhen, Zhilong
Author_Institution
Dept. of Math., Tonghua Normal Univ., Tonghua, China
fYear
2009
fDate
26-28 Dec. 2009
Firstpage
857
Lastpage
860
Abstract
Text categorization usually suffers from a huge-scale number of features. Most of those are irrelevant and noise which could mislead the classifier. In order to improve the efficiency and effectiveness for text categorization, feature selection is often performed. In this paper, a novel feature selection approach for dealing with text categorization, called Maximum Information Metric (MIM), is proposed to get good quality terms of documents. This method exploits the weight of term and document frequency to construct the correlation between a term and each class. It aims to maximize the differences of term over each class based on information theory. We design a better evaluation function to yield a kind of ranking of the features. Experimental results on the standard Reuters-21578 and 20-Newsgroups corpus show that the new feature selection approach outperforms the classic methods including Information Gain (IG), Chi-square statistic (CHI) in a context of text categorization.
Keywords
document handling; feature extraction; information retrieval; information theory; text analysis; 20-Newsgroups corpus; Chi-square statistic; Information Gain; classifier; document frequency; evaluation function; feature selection; huge-scale number; information theory; maximum information metric; standard Reuters-21578; term weight; text categorization; Computer science; Educational institutions; Information filtering; Information filters; Information retrieval; Information science; Information theory; Mathematics; Statistics; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Science and Engineering (ICISE), 2009 1st International Conference on
Conference_Location
Nanjing
Print_ISBN
978-1-4244-4909-5
Type
conf
DOI
10.1109/ICISE.2009.591
Filename
5454885
Link To Document