• DocumentCode
    3243090
  • Title

    A Text Feature Selection Algorithm Based on Improved TFIDF

  • Author

    Yang, Chengcheng ; He, Xingshi

  • Author_Institution
    Xi´´an Polytech. Univ., Xi´´an
  • fYear
    2008
  • fDate
    22-24 Oct. 2008
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    In Chinese text categorization system, for most classifiers using vector space model (VSM), all attributes of documents construct a high dimensional feature space. And the high dimensionality of feature space is the bottleneck of categorization. TFIDF is a kind of common methods used to measure the terms in a document. The method is easy but it doesn´t consider the unbalance distribution of terms among classes. This paper analyzed the TFIDF feature selection algorithm deeply, and proposed a new TFIDF feature selection method based on Gini index theory. Experimental results show the method is valid in improving the accuracy of text categorization.
  • Keywords
    natural language processing; text analysis; Chinese text categorization system; Gini index theory; TFIDF feature selection method; text feature selection algorithm; vector space model; Algorithm design and analysis; Electronic mail; Entropy; Frequency; Helium; Mutual information; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2008. CCPR '08. Chinese Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-2316-3
  • Type

    conf

  • DOI
    10.1109/CCPR.2008.87
  • Filename
    4663040