Title :
Text Categorization Based on Dissimilarity Representation and Prototype Selection
Author :
Roberto H.W. Pinheiro;George D.C. Cavalcanti;Tsang Ing Ren
Author_Institution :
Centro de Inf., Univ. Fed. de Pernambuco, Cidade Universitaria, Brazil
Abstract :
Bag-of-Words is the most used representation in text categorization, however it has some problems because its representation produces sparse high-dimensional feature vectors and have high feature-to-instance ratio. Feature selection is the most common approach to alleviate these problems. However, feature selection does not solve all the problems and information is lost in the process. In this paper, we propose a method based on dissimilarity representation and prototype selection to address these problems. Dissimilarity representation reduces the problems of Bag-of-Words and prototype selection is used to select a smaller representation set, increasing the benefits of using dissimilarity representation. The experimental study evaluated the effectiveness of the proposed method on four text categorization databases (RCV1, Reuters, TDT2, and WebKB) using Support Vector Machines. The proposed method reduces the number of features in 79% on average and presents better, or similar, results in 84% of the cases when compared with the Bag-of-Words approach.
Keywords :
"Prototypes","Training","Text categorization","Databases","Noise measurement","Euclidean distance","Sparse matrices"
Conference_Titel :
Intelligent Systems (BRACIS), 2015 Brazilian Conference on
DOI :
10.1109/BRACIS.2015.28