DocumentCode :
468354
Title :
Accurate Chinese Text Classification via Multiple Strategies
Author :
Hao, Xiulan ; Zhang, Chenghong ; Tao, Xiaopeng ; Wang, Shuyun ; Hu, Yunfa
Author_Institution :
Fudan Univ., Shanghai
Volume :
3
fYear :
2007
fDate :
24-27 Aug. 2007
Firstpage :
504
Lastpage :
508
Abstract :
Text classification is one of means to understand text content. It is widely used in information retrieving, filtering spam, monitoring ill gossips, and blocking pornographic and evil messages. kN N is widely used in text categorization, but it suffers from biased training data set. In developing Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare (smaller) categories are classified into common (larger) ones by traditional kN N classifier. The performance of text classification can not satisfy the user´s requirement in this case. To alleviate such a misfortune, we adopt 2 measures to boost kN N classifier. Firstly, we optimize features by removing some candidate features. Secondly, we modify traditional decision rules by integrating number of training samples of each category with them. Exhaustive experiments illustrate that the adapted kN N achieves significant classification performance improvement on biased corpora.
Keywords :
text analysis; Chinese text classification; Internet information security; Shanghai Council of Information and Security; biased training data set; text categorization; text content; Councils; Information filtering; Information filters; Information retrieval; Information security; Internet; Monitoring; Prototypes; Text categorization; Training data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on
Conference_Location :
Haikou
Print_ISBN :
978-0-7695-2874-8
Type :
conf
DOI :
10.1109/FSKD.2007.132
Filename :
4406289
Link To Document :
بازگشت