DocumentCode
2335965
Title
A simple KNN algorithm for text categorization
Author
Soucy, Pascal ; Mineau, Guy W.
Author_Institution
Dept. of Comput. Sci., Laval Univ., Que., Canada
fYear
2001
fDate
2001
Firstpage
647
Lastpage
648
Abstract
Text categorization (also called text classification) is the process of identifying the class to which a text document belongs. This paper proposes to use a simple non-weighted features KNN algorithm for text categorization. We propose to use a feature selection method that finds the relevant features for the learning task at hand using feature interaction (based on word interdependencies). This will allow us to reduce considerably the number Of selected features from which to learn, making our KNN algorithm applicable in contexts where both the volume of documents and the size of the vocabulary are high, like with the World Wide Web. Therefore, the KNN algorithm that we propose becomes efficient for classifying text documents in that context (in terms of its predictability and interpretability), as is demonstrated. Its simplicity (WRT its implementation and fine-tuning) becomes its main assets for in-the-field applications
Keywords
classification; feature extraction; text analysis; World Wide Web; feature interaction; feature selection method; learning task; nonweighted features KNN algorithm; text categorization; text classification; text document; word interdependencies; Computer science; Frequency conversion; Solids; Testing; Text categorization; Unsolicited electronic mail; Vocabulary; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
Conference_Location
San Jose, CA
Print_ISBN
0-7695-1119-8
Type
conf
DOI
10.1109/ICDM.2001.989592
Filename
989592
Link To Document