DocumentCode
3243090
Title
A Text Feature Selection Algorithm Based on Improved TFIDF
Author
Yang, Chengcheng ; He, Xingshi
Author_Institution
Xi´´an Polytech. Univ., Xi´´an
fYear
2008
fDate
22-24 Oct. 2008
Firstpage
1
Lastpage
4
Abstract
In Chinese text categorization system, for most classifiers using vector space model (VSM), all attributes of documents construct a high dimensional feature space. And the high dimensionality of feature space is the bottleneck of categorization. TFIDF is a kind of common methods used to measure the terms in a document. The method is easy but it doesn´t consider the unbalance distribution of terms among classes. This paper analyzed the TFIDF feature selection algorithm deeply, and proposed a new TFIDF feature selection method based on Gini index theory. Experimental results show the method is valid in improving the accuracy of text categorization.
Keywords
natural language processing; text analysis; Chinese text categorization system; Gini index theory; TFIDF feature selection method; text feature selection algorithm; vector space model; Algorithm design and analysis; Electronic mail; Entropy; Frequency; Helium; Mutual information; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Pattern Recognition, 2008. CCPR '08. Chinese Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-2316-3
Type
conf
DOI
10.1109/CCPR.2008.87
Filename
4663040
Link To Document