Title :
Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity
Author :
Song, Xin ; Huang, Jia ; Zhou, Jing-min ; Chen, Xi
Author_Institution :
State Key Lab. of Software Dev. Environ., Beihang Univ., Beijing, China
Abstract :
To overcome the limitations of traditional text classification approaches based on bag-of-words representation and to effectively incorporate linguistic knowledge and conceptual index into text vector space model, based on two thesaurus HowNet and Tongyici Cilin (hereinafter referred to Cilin), we use semantic vector to describe a document instead of traditional keywords vector, which is based on merging words with high similarity and using a concept to describe the semantic feature rather than a series of words. It not only reduces feature dimension but also adds semantic information to the vector. We also use sentence (document) similarity based on simple vector distance to classify the text and three groups of experiments are made respectively. The results show that the accuracy rates are generally improved along with semantic treatment, which indicates that semantic mining is very important and necessary to text classification.
Keywords :
natural language processing; pattern classification; text analysis; thesauri; Chinese text classification methods; Tongyici Cilin; bag-of-words representation; linguistic knowledge; semantic mining; semantic similarity; semantic vector; text vector space model; thesaurus HowNet; Application software; Computer applications; Data mining; Merging; Multidimensional systems; Natural languages; Programming; Statistical analysis; Text categorization; Thesauri; HowNet; Semantic Similarity; Semantic Vector; Text Classification; Tongyici Cilin;
Conference_Titel :
Computer Science-Technology and Applications, 2009. IFCSTA '09. International Forum on
Conference_Location :
Chongqing
Print_ISBN :
978-0-7695-3930-0
Electronic_ISBN :
978-1-4244-5423-5
DOI :
10.1109/IFCSTA.2009.167