• DocumentCode
    423712
  • Title

    Pruning the vocabulary for better context recognition

  • Author

    Madsen, Rasmus Elsborg ; Sigurdsson, Sigurdur ; Hansen, Lars Kai ; Larsen, Jan

  • Author_Institution
    Inf. & Math. Modelling, Tech. Univ. Denmark, Lyngby, Denmark
  • Volume
    2
  • fYear
    2004
  • fDate
    25-29 July 2004
  • Firstpage
    1439
  • Abstract
    Language independent ´bag-of-words´ representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations. In this communication, our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies, documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.
  • Keywords
    indexing; neural nets; pattern classification; principal component analysis; probability; semantic Web; text analysis; vocabulary; bag-of-words representation; bag-of-words vocabularies; context recognition; generalization performance; information gain; nonconsistent words; principal component transformations; probabilistic neural network classifier; semantic indexing representation; sensitivity maps; subsequent classifiers; text categorization; text classification; Databases; Humans; Indexing; Internet; Large scale integration; Learning systems; Machine learning; Neural networks; Text categorization; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on
  • ISSN
    1098-7576
  • Print_ISBN
    0-7803-8359-1
  • Type

    conf

  • DOI
    10.1109/IJCNN.2004.1380163
  • Filename
    1380163