• DocumentCode
    2335965
  • Title

    A simple KNN algorithm for text categorization

  • Author

    Soucy, Pascal ; Mineau, Guy W.

  • Author_Institution
    Dept. of Comput. Sci., Laval Univ., Que., Canada
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    647
  • Lastpage
    648
  • Abstract
    Text categorization (also called text classification) is the process of identifying the class to which a text document belongs. This paper proposes to use a simple non-weighted features KNN algorithm for text categorization. We propose to use a feature selection method that finds the relevant features for the learning task at hand using feature interaction (based on word interdependencies). This will allow us to reduce considerably the number Of selected features from which to learn, making our KNN algorithm applicable in contexts where both the volume of documents and the size of the vocabulary are high, like with the World Wide Web. Therefore, the KNN algorithm that we propose becomes efficient for classifying text documents in that context (in terms of its predictability and interpretability), as is demonstrated. Its simplicity (WRT its implementation and fine-tuning) becomes its main assets for in-the-field applications
  • Keywords
    classification; feature extraction; text analysis; World Wide Web; feature interaction; feature selection method; learning task; nonweighted features KNN algorithm; text categorization; text classification; text document; word interdependencies; Computer science; Frequency conversion; Solids; Testing; Text categorization; Unsolicited electronic mail; Vocabulary; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
  • Conference_Location
    San Jose, CA
  • Print_ISBN
    0-7695-1119-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2001.989592
  • Filename
    989592