• DocumentCode
    2976573
  • Title

    SRFW: a simple, fast and effective text classification algorithm

  • Author

    Deng, Zhi-Hong ; Tang, Shi-Wei ; Yang, Dong-Qing ; Zhang, Ming ; Wu, Xiao-Bin ; Yang, Meng

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Peking Univ., Beijing, China
  • Volume
    3
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    1267
  • Abstract
    Text classification is a powerful technique for automating assignment of documents to topic hierarchies. Although there are a number of text classification algorithms, most of them are either inefficient or too complex. We present a linear text classification algorithm called SRFW, which is fast, effective and easily used. SRFW obtains relevance factors. For new unlabelled documents, SRFW adopts sum of weights based on relevance factors to obtain the probability that these documents belong to each category and assigns them to categories that have the biggest probability. We have evaluated our algorithm on a subset of Reuters-21578 and 20-newsgroups text collections and compared it against k-NN and SVM. Experimental results show that SRFW is competitive with k nearest neighbor (k-NN) and support vector machines (SVM), while SRFW is much simpler and faster than them.
  • Keywords
    information retrieval; learning automata; natural languages; pattern classification; text analysis; 20-newsgroups text collection; Reuters-21578 text collection; SRFW; discriminating power; documents assignment; k-nearest neighbor method; linear text classification algorithm; relevance factors; statistical methods; support vector machines; topic hierarchies; unlabelled documents; Classification algorithms; Computer science; Electronic mail; Laboratories; Nearest neighbor searches; Neural networks; Probability; Support vector machine classification; Support vector machines; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference on
  • Print_ISBN
    0-7803-7508-4
  • Type

    conf

  • DOI
    10.1109/ICMLC.2002.1167407
  • Filename
    1167407