Title :
Automatic text categorization by a Granular Computing approach: Facing unbalanced data sets
Author :
Possemato, Francesca ; Rizzi, Antonello
Author_Institution :
Dept. of Inf. Eng., Electron., & Telecommun., SAPIENZA Univ. of Rome, Rome, Italy
Abstract :
Text categorization is an interesting application of machine learning covering a wide range of possible applications, from document management systems to web mining. In designing such a system it is mandatory to correctly define both a suited preprocessing procedure and an effective document representation as closely related as possible to the semantic nature of document categories. To this aim, relying on a Granular Computing approach and considering a document as an ordered sequence of words, we propose a system able to automatically mine frequent terms, considering as a term not only a single word, but also a subsequence of (a few) consecutive words. The whole classification system is tailored to process sequences of atomic elements (i.e., encoded words) by means of an embedding procedure based on clustering methods. However, when dealing with unbalanced data sets, i.e. when classes are not evenly represented in the data set, the frequent substructures search procedure must be carefully designed. We prove the effectiveness of the system over a well-known benchmarking data set, achieving competitive test set classification accuracy results, with a remarkable low structural complexity of the synthesized classification models.
Keywords :
data mining; granular computing; learning (artificial intelligence); pattern classification; pattern clustering; text analysis; Web mining; atomic elements; automatic text categorization; classification system; clustering methods; document categories; document management systems; document representation; embedding procedure; encoded words; frequent substructures search procedure; frequent terms mining; granular computing; machine learning; preprocessing procedure; system design; unbalanced data sets; Computational modeling; Feature extraction; Histograms; Natural languages; Text categorization; Vectors; Frequent substructures mining; Granular computing; Text categorization; Unbalanced data sets;
Conference_Titel :
Neural Networks (IJCNN), The 2013 International Joint Conference on
Conference_Location :
Dallas, TX
Print_ISBN :
978-1-4673-6128-6
DOI :
10.1109/IJCNN.2013.6707082