Title :
Document representation combining concepts and words in Chinese text categorization
Author :
Che, Chao ; Teng, Hongfei
Author_Institution :
Dalian Univ. of Technol., Dalian, China
Abstract :
Word-based representation is widely used in text categorization. However, performance of this approach is affected by the problems derived from language variation. In this paper, we investigate a document representation combining words and concepts to integrate the advantages of two types of representations. The approach takes the part of speech as the concept for the word which is error-prone in word sense disambiguation to reduce the disambiguation mistakes. The approach employs three ways to measure the contributions of different representation forms to classification and selects the most productive one as the feature to drop the concepts not suitable for representation while not losing the lexical semantic information. We conduct experiments to compare the performance of different types of representations on Chinese text categorization corpus of Fudan University. And the results confirm the validity of our combination representation.
Keywords :
natural language processing; text analysis; Chinese text categorization; document representation; language variation; lexical semantic information; word sense disambiguation; word-based representation; Channel hot electron injection; Chaos; Dictionaries; Frequency; Robustness; Speech; Text categorization; Thesauri; Text categorization; combination representation; concept-based representation;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4244-4538-7
Electronic_ISBN :
978-1-4244-4540-0
DOI :
10.1109/NLPKE.2009.5313771