DocumentCode
3358854
Title
Automatic Chinese text categorization system based on mutual information
Author
Lu, Zhimao ; Shi, Hong ; Zhang, Qi ; Yuan, Chaoyue
Author_Institution
Inf. & Commun. Eng. Coll., Harbin Eng. Univ., Harbin, China
fYear
2009
fDate
9-12 Aug. 2009
Firstpage
4986
Lastpage
4990
Abstract
Feature selection is a key step in automatic text categorization system and it has a significant impact on classification result. In this paper we do research on mutual information (MI) which is one basic method of feature selection. Firstly, we found out three main problems of MI by analyzing the formula of MI theoretically and systematically the MI loss, the information difference among categories, and the excessive emphasis on low-frequency terms. Then, to solve these three questions, we proposed an improved feature selection method by calculating the absolute values of MI and calculating the differential values between maximum and average of MI. At last, we tested our method using K-Nearest Neighbor (KNN) classifier and Support Vector Machine (SVM) classifier respectively, and we also compared it with the original method on Chinese corpus. The results demonstrate the effectiveness and feasibility of the proposed method.
Keywords
natural language processing; pattern classification; support vector machines; text analysis; Chinese corpus method; K-nearest neighbor classifier; automatic Chinese text categorization system; feature selection method; mutual information; support vector machine classifier; Automation; Chaotic communication; Educational institutions; Information analysis; Mechatronics; Mutual information; Space technology; Support vector machine classification; Support vector machines; Text categorization; Automatic Text Categorization; Feature Selection; KNN; Mutual Information; SVM;
fLanguage
English
Publisher
ieee
Conference_Titel
Mechatronics and Automation, 2009. ICMA 2009. International Conference on
Conference_Location
Changchun
Print_ISBN
978-1-4244-2692-8
Electronic_ISBN
978-1-4244-2693-5
Type
conf
DOI
10.1109/ICMA.2009.5245990
Filename
5245990
Link To Document