• DocumentCode
    1948747
  • Title

    Automatic Identification of Stop Words in Chinese Text Classification

  • Author

    Hao, Lili ; Hao, Lizhu

  • Author_Institution
    Inst. of Math., Jilin Univ., Changchun
  • Volume
    1
  • fYear
    2008
  • fDate
    12-14 Dec. 2008
  • Firstpage
    718
  • Lastpage
    722
  • Abstract
    Text classification is an active research area in information retrieval and natural language processing. A fundamental tool in text classification is a list of ´stop´ words(stop word list) that is used to identify frequent words that are unlikely to assist in classification and hence are deleted during pre-processing. Till now, many stop word lists have been developed for English language. However, there is no standard stop word list which has been constructed for Chinese text classification yet. In this paper, we give a refined definition for stop words in Chinese text classification from a perspective of statistical correlation, then propose an automatic approach to extracting the stop word list in text classification based on the weighted Chi-squared statistic on 2*p contingency table. We evaluate the stop word lists using accuracies obtained from text classification experiments in the real-world Chinese corpus. The results show that the proposed approach is effective. The stop word lists derived by the approach can speed up the calculation and increase the accuracy of classification at the same time.
  • Keywords
    information retrieval; natural language processing; pattern classification; statistics; text analysis; Chinese text classification; English language; automatic stop words identification; information retrieval; natural language processing; weighted Chi-squared statistic; Computer science; Employment; Information retrieval; Mathematics; Natural language processing; Natural languages; Software engineering; Statistics; Text analysis; Text categorization; Chinese text classification; Stop words; weighted Chi-squared statistic;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Software Engineering, 2008 International Conference on
  • Conference_Location
    Wuhan, Hubei
  • Print_ISBN
    978-0-7695-3336-0
  • Type

    conf

  • DOI
    10.1109/CSSE.2008.829
  • Filename
    4721850