• DocumentCode
    2961204
  • Title

    Class document frequency as a learned feature for text categorization

  • Author

    Sharma, Anand ; Kuh, Anthony

  • Author_Institution
    Dept. of Electr. Eng., Univ. of Hawaii, Honolulu, HI
  • fYear
    2008
  • fDate
    1-8 June 2008
  • Firstpage
    2988
  • Lastpage
    2993
  • Abstract
    Document classification uses different types of word weightings as features for representation of documents. In our findings we find the class document frequency, dfc, of a word is the most important feature in document classification. Machine learning algorithms trained with dfc of words show similar performance in terms of correct classification of test documents when compared to more complicated features. The importance of dfc is further verified when simple algorithms developed solely on the basis of df c shows performance that compares closely with that of more complex machine learning algorithms. We also found improved performance when the dfc of links of documents in a class is used along with the dfc of the words of the document. We compared the algorithms for showing the importance of dfc on the Reuters-21578 text categorization test classification and the Cora data set.
  • Keywords
    classification; learning (artificial intelligence); text analysis; class document frequency; document classification; machine learning; text categorization; Bayesian methods; Classification algorithms; Frequency; Information retrieval; Internet; Machine learning; Machine learning algorithms; Search engines; Testing; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on
  • Conference_Location
    Hong Kong
  • ISSN
    1098-7576
  • Print_ISBN
    978-1-4244-1820-6
  • Electronic_ISBN
    1098-7576
  • Type

    conf

  • DOI
    10.1109/IJCNN.2008.4634218
  • Filename
    4634218