• DocumentCode
    383428
  • Title

    Discriminative features for document classification

  • Author

    Torkkola, Kari

  • Author_Institution
    Motorola Labs., Tempe, AZ, USA
  • Volume
    1
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    472
  • Abstract
    Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing (LSI) is one such method that is based on eigendecomposition of the covariance of the document-term matrix. Another often used approach is to select a small number of most important features out of the whole set according to some relevant criterion. This paper points out that LSI ignores discrimination while concentrating on representation. Furthermore, selection methods fail to produce a feature set that jointly optimizes class discrimination. As a remedy, we suggest supervised linear discriminative transforms, and report good classification results applying these to the Reuters-21578 database.
  • Keywords
    document image processing; eigenvalues and eigenfunctions; image classification; image representation; Reuters-21578 database; bag-of-words approach; discriminative features; document classification; document representation; document-term matrix; eigendecomposition; latent semantic indexing; statistical classification methods; supervised linear discriminative transforms; Covariance matrix; Databases; Electronic mail; Indexing; Linear discriminant analysis; Optimization methods; Pattern recognition; Rivers; Support vector machine classification; Support vector machines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2002. Proceedings. 16th International Conference on
  • ISSN
    1051-4651
  • Print_ISBN
    0-7695-1695-X
  • Type

    conf

  • DOI
    10.1109/ICPR.2002.1044765
  • Filename
    1044765