DocumentCode :
2689126
Title :
Text Classification Using a Graph of Terms
Author :
Napoletano, Paolo ; Colace, Francesco ; De Santo, Massimo ; Greco, Luca
Author_Institution :
Dept. of Electron. Eng. & Comput. Eng., Univ. of Salerno, Fisciano, Italy
fYear :
2012
fDate :
4-6 July 2012
Firstpage :
1030
Lastpage :
1035
Abstract :
It is well known that supervised text classification methods need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy even if the size of the training set is not big. The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Most existing methods usually make use of a vector of features made up of weighted words that unfortunately are insufficiently discriminative when the number of features is much higher than the number of labeled examples. In this paper we demonstrate that, to obtain a greater accuracy in the analysis and revelation of common patterns, we could employ more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a set of documents through the probabilistic Topic Model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset, learned on several subsets of the original training set and showing a better performance than a method using a list of weighted words as a vector of features and linear support vector machines.
Keywords :
classification; data mining; learning (artificial intelligence); support vector machines; text analysis; Graph of Terms; ModApte split; Reuters-21578 dataset; documents; graph of terms; linear support vector machines; probabilistic topic model; supervised text classification methods; text mining techniques; training set; word directed sub-graph; word undirected sub-graph; Accuracy; Computational modeling; Feature extraction; Probabilistic logic; Semantics; Training; Vectors; Text classification; probabilistic topic model; term extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Complex, Intelligent and Software Intensive Systems (CISIS), 2012 Sixth International Conference on
Conference_Location :
Palermo
Print_ISBN :
978-1-4673-1233-2
Type :
conf
DOI :
10.1109/CISIS.2012.183
Filename :
6245692
Link To Document :
بازگشت