• DocumentCode
    3358854
  • Title

    Automatic Chinese text categorization system based on mutual information

  • Author

    Lu, Zhimao ; Shi, Hong ; Zhang, Qi ; Yuan, Chaoyue

  • Author_Institution
    Inf. & Commun. Eng. Coll., Harbin Eng. Univ., Harbin, China
  • fYear
    2009
  • fDate
    9-12 Aug. 2009
  • Firstpage
    4986
  • Lastpage
    4990
  • Abstract
    Feature selection is a key step in automatic text categorization system and it has a significant impact on classification result. In this paper we do research on mutual information (MI) which is one basic method of feature selection. Firstly, we found out three main problems of MI by analyzing the formula of MI theoretically and systematically the MI loss, the information difference among categories, and the excessive emphasis on low-frequency terms. Then, to solve these three questions, we proposed an improved feature selection method by calculating the absolute values of MI and calculating the differential values between maximum and average of MI. At last, we tested our method using K-Nearest Neighbor (KNN) classifier and Support Vector Machine (SVM) classifier respectively, and we also compared it with the original method on Chinese corpus. The results demonstrate the effectiveness and feasibility of the proposed method.
  • Keywords
    natural language processing; pattern classification; support vector machines; text analysis; Chinese corpus method; K-nearest neighbor classifier; automatic Chinese text categorization system; feature selection method; mutual information; support vector machine classifier; Automation; Chaotic communication; Educational institutions; Information analysis; Mechatronics; Mutual information; Space technology; Support vector machine classification; Support vector machines; Text categorization; Automatic Text Categorization; Feature Selection; KNN; Mutual Information; SVM;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Mechatronics and Automation, 2009. ICMA 2009. International Conference on
  • Conference_Location
    Changchun
  • Print_ISBN
    978-1-4244-2692-8
  • Electronic_ISBN
    978-1-4244-2693-5
  • Type

    conf

  • DOI
    10.1109/ICMA.2009.5245990
  • Filename
    5245990