• DocumentCode
    2546118
  • Title

    An evolutionary approach for discovering effective composite features for text categorization

  • Author

    Wong, Alex K S ; Lee, John W T

  • Author_Institution
    Hong Kong Polytech. Univ., Kowloon
  • fYear
    2007
  • fDate
    7-10 Oct. 2007
  • Firstpage
    3045
  • Lastpage
    3050
  • Abstract
    The study of text categorization has assumed special significance in the Internet era in helping us navigate the ocean of web pages and emails that continue to grow in an unrelenting pace. In many previous works on text classifications, it has been shown that composite features consisting of multiple word tokens like statistical phrases can contribute effectively to the classification task. However finding useful composite features through comprehensive search from the vast number of possibilities is often prohibitive in terms of computing resource requirements. In the past, to make the search feasible, we often limit the search space by imposing some parametric constraints like minimum frequency and/or number of words in the composite feature. In this paper we proposed a new evolutionary approach to find effective composite features for classification, an approach that combines probabilistic feature generation with error-biased sampling We demonstrate the effectiveness of our approach using the Reuters-21578 test collection.
  • Keywords
    evolutionary computation; feature extraction; sampling methods; text analysis; composite features; error-biased sampling; evolutionary approach; multiple word tokens; parametric constraints; probabilistic feature generation; statistical phrases; text categorization; text classifications; Electronic mail; Explosions; Frequency; Internet; Navigation; Oceans; Sampling methods; Testing; Text categorization; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
  • Conference_Location
    Montreal, Que.
  • Print_ISBN
    978-1-4244-0990-7
  • Electronic_ISBN
    978-1-4244-0991-4
  • Type

    conf

  • DOI
    10.1109/ICSMC.2007.4413981
  • Filename
    4413981