• DocumentCode
    3756484
  • Title

    Text Categorization Based on Dissimilarity Representation and Prototype Selection

  • Author

    Roberto H.W. Pinheiro;George D.C. Cavalcanti;Tsang Ing Ren

  • Author_Institution
    Centro de Inf., Univ. Fed. de Pernambuco, Cidade Universitaria, Brazil
  • fYear
    2015
  • Firstpage
    163
  • Lastpage
    168
  • Abstract
    Bag-of-Words is the most used representation in text categorization, however it has some problems because its representation produces sparse high-dimensional feature vectors and have high feature-to-instance ratio. Feature selection is the most common approach to alleviate these problems. However, feature selection does not solve all the problems and information is lost in the process. In this paper, we propose a method based on dissimilarity representation and prototype selection to address these problems. Dissimilarity representation reduces the problems of Bag-of-Words and prototype selection is used to select a smaller representation set, increasing the benefits of using dissimilarity representation. The experimental study evaluated the effectiveness of the proposed method on four text categorization databases (RCV1, Reuters, TDT2, and WebKB) using Support Vector Machines. The proposed method reduces the number of features in 79% on average and presents better, or similar, results in 84% of the cases when compared with the Bag-of-Words approach.
  • Keywords
    "Prototypes","Training","Text categorization","Databases","Noise measurement","Euclidean distance","Sparse matrices"
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems (BRACIS), 2015 Brazilian Conference on
  • Type

    conf

  • DOI
    10.1109/BRACIS.2015.28
  • Filename
    7424013