DocumentCode
2961204
Title
Class document frequency as a learned feature for text categorization
Author
Sharma, Anand ; Kuh, Anthony
Author_Institution
Dept. of Electr. Eng., Univ. of Hawaii, Honolulu, HI
fYear
2008
fDate
1-8 June 2008
Firstpage
2988
Lastpage
2993
Abstract
Document classification uses different types of word weightings as features for representation of documents. In our findings we find the class document frequency, dfc, of a word is the most important feature in document classification. Machine learning algorithms trained with dfc of words show similar performance in terms of correct classification of test documents when compared to more complicated features. The importance of dfc is further verified when simple algorithms developed solely on the basis of df c shows performance that compares closely with that of more complex machine learning algorithms. We also found improved performance when the dfc of links of documents in a class is used along with the dfc of the words of the document. We compared the algorithms for showing the importance of dfc on the Reuters-21578 text categorization test classification and the Cora data set.
Keywords
classification; learning (artificial intelligence); text analysis; class document frequency; document classification; machine learning; text categorization; Bayesian methods; Classification algorithms; Frequency; Information retrieval; Internet; Machine learning; Machine learning algorithms; Search engines; Testing; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on
Conference_Location
Hong Kong
ISSN
1098-7576
Print_ISBN
978-1-4244-1820-6
Electronic_ISBN
1098-7576
Type
conf
DOI
10.1109/IJCNN.2008.4634218
Filename
4634218
Link To Document