DocumentCode :
3358854
Title :
Automatic Chinese text categorization system based on mutual information
Author :
Lu, Zhimao ; Shi, Hong ; Zhang, Qi ; Yuan, Chaoyue
Author_Institution :
Inf. & Commun. Eng. Coll., Harbin Eng. Univ., Harbin, China
fYear :
2009
fDate :
9-12 Aug. 2009
Firstpage :
4986
Lastpage :
4990
Abstract :
Feature selection is a key step in automatic text categorization system and it has a significant impact on classification result. In this paper we do research on mutual information (MI) which is one basic method of feature selection. Firstly, we found out three main problems of MI by analyzing the formula of MI theoretically and systematically the MI loss, the information difference among categories, and the excessive emphasis on low-frequency terms. Then, to solve these three questions, we proposed an improved feature selection method by calculating the absolute values of MI and calculating the differential values between maximum and average of MI. At last, we tested our method using K-Nearest Neighbor (KNN) classifier and Support Vector Machine (SVM) classifier respectively, and we also compared it with the original method on Chinese corpus. The results demonstrate the effectiveness and feasibility of the proposed method.
Keywords :
natural language processing; pattern classification; support vector machines; text analysis; Chinese corpus method; K-nearest neighbor classifier; automatic Chinese text categorization system; feature selection method; mutual information; support vector machine classifier; Automation; Chaotic communication; Educational institutions; Information analysis; Mechatronics; Mutual information; Space technology; Support vector machine classification; Support vector machines; Text categorization; Automatic Text Categorization; Feature Selection; KNN; Mutual Information; SVM;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Mechatronics and Automation, 2009. ICMA 2009. International Conference on
Conference_Location :
Changchun
Print_ISBN :
978-1-4244-2692-8
Electronic_ISBN :
978-1-4244-2693-5
Type :
conf
DOI :
10.1109/ICMA.2009.5245990
Filename :
5245990
Link To Document :
بازگشت