Title :
Text Classification Improved through Automatically Extracted Sequences
Author :
Shen, Dou ; Sun, Jian-Tao ; Yang, Qiang ; Zhao, Hui ; Chen, Zheng
Author_Institution :
Hong Kong University of Science and Technology
Abstract :
We propose to use the n-multigram model to help the automatic text classification task. This model could automatically discover the latent semantic sequences contained in the document set of each category. Based on the n-multigram model and the n-gram language model, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram model can achieve the similar classification performance compared with the one based on n-gram model. However, the model size of our algorithm is only 4.21% of the latter one. Another proposed algorithm based on the combination of nmultigram model and n-gram model improves the micro- F1 and macro-F1 values by 3.5% and 4.5% respectively which support the validity of our approach.
Keywords :
Asia; Classification algorithms; Computer science; Knowledge management; Natural language processing; Probability distribution; Sun; Text categorization; Uncertainty; Vocabulary;
Conference_Titel :
Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on
Print_ISBN :
0-7695-2570-9
DOI :
10.1109/ICDE.2006.158