DocumentCode :
2961204
Title :
Class document frequency as a learned feature for text categorization
Author :
Sharma, Anand ; Kuh, Anthony
Author_Institution :
Dept. of Electr. Eng., Univ. of Hawaii, Honolulu, HI
fYear :
2008
fDate :
1-8 June 2008
Firstpage :
2988
Lastpage :
2993
Abstract :
Document classification uses different types of word weightings as features for representation of documents. In our findings we find the class document frequency, dfc, of a word is the most important feature in document classification. Machine learning algorithms trained with dfc of words show similar performance in terms of correct classification of test documents when compared to more complicated features. The importance of dfc is further verified when simple algorithms developed solely on the basis of df c shows performance that compares closely with that of more complex machine learning algorithms. We also found improved performance when the dfc of links of documents in a class is used along with the dfc of the words of the document. We compared the algorithms for showing the importance of dfc on the Reuters-21578 text categorization test classification and the Cora data set.
Keywords :
classification; learning (artificial intelligence); text analysis; class document frequency; document classification; machine learning; text categorization; Bayesian methods; Classification algorithms; Frequency; Information retrieval; Internet; Machine learning; Machine learning algorithms; Search engines; Testing; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on
Conference_Location :
Hong Kong
ISSN :
1098-7576
Print_ISBN :
978-1-4244-1820-6
Electronic_ISBN :
1098-7576
Type :
conf
DOI :
10.1109/IJCNN.2008.4634218
Filename :
4634218
Link To Document :
بازگشت