Title :
Information-theoretic feature selection algorithms for text classification
Author :
J. Novovicova;A. Malik
Author_Institution :
Inst. of Inf. Theor. & Autom., Acad. of Sci. of the Czech Republic, Prague, Czech Republic
fDate :
6/27/1905 12:00:00 AM
Abstract :
A major characteristic of text document classification problem is extremely high dimensionality of text data. In this paper, we present four new algorithms for feature/word selection for the purpose of text classification. We use sequential forward selection methods based on improved mutual information criterion functions. The performance of the proposed evaluation functions compared to the information gain which evaluate features individually is discussed. We present experimental results using naive Bayes classifier based on multinomial model, linear support vector machine and k-nearest neighbor classifiers on the Reuters data set. Finally, we analyze the experimental results from various perspectives, including precision, recall and F/sub 1/-measure. Preliminary experimental results indicate the effectiveness of the proposed feature selection algorithms in a text classification.
Keywords :
"Classification algorithms","Text categorization","Support vector machines","Support vector machine classification","Frequency","Vocabulary","Information theory","Automation","Mutual information","Performance gain"
Conference_Titel :
Neural Networks, 2005. IJCNN ´05. Proceedings. 2005 IEEE International Joint Conference on
Print_ISBN :
0-7803-9048-2
Electronic_ISBN :
2161-4407
DOI :
10.1109/IJCNN.2005.1556452