• DocumentCode
    2480533
  • Title

    A Study for Important Criteria of Feature Selection in Text Categorization

  • Author

    Xu Yan

  • Author_Institution
    Beijing Language & Culture Univ., Beijing, China
  • fYear
    2010
  • fDate
    22-23 May 2010
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    A major difficulty of text categorization is the high dimensionality of the feature space. Feature selection is an important step in text categorization to reduce the feature space. Empirical studies of text categorization show that good text categorization performance is related to some feature selection criteria, and when a criterion is not satisfied, it often indicates non-optimality of the method. According to our analysis, there are some reasons for good performance of feature selection in text categorization tasks: favoring common terms, using category information and using term frequency information), and so on. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization, but none of them satisfies all the criteria above. In this paper, we present some Important criteria of FS in TC. Experimental results indicate that the empirical performance of a FS function is tightly related to how well it satisfies these criteria.
  • Keywords
    text analysis; category information; document frequency thresholding; feature selection criteria; feature space; frequency information; information gain; mutual information; text categorization; Availability; Document handling; Frequency measurement; Gain measurement; Information analysis; Mutual information; Organizing; Performance analysis; Space technology; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems and Applications (ISA), 2010 2nd International Workshop on
  • Conference_Location
    Wuhan
  • Print_ISBN
    978-1-4244-5872-1
  • Electronic_ISBN
    978-1-4244-5874-5
  • Type

    conf

  • DOI
    10.1109/IWISA.2010.5473381
  • Filename
    5473381