DocumentCode :
244726
Title :
Classifying text documents using unconventional representation
Author :
Harish, B.S. ; Aruna Kumar, S.V. ; Manjunath, S.
Author_Institution :
Dept. of Inf. Sci. & Eng., S.J. Coll. of Eng., Mysore, India
fYear :
2014
fDate :
15-17 Jan. 2014
Firstpage :
210
Lastpage :
216
Abstract :
Classification of text documents is one of the most common themes in the field of machine learning. Although a text document expresses a wide range of information, but it lacks the imposed structure of tradition database. Thus, unstructured data, particularly free running text data has to be transferred into a structured data. Hence, in this paper we represent the text document unconventionally by making use of symbolic data analysis concepts. We propose a new method of representing documents based on clustering of term frequency vectors. Term frequency vectors of each cluster are used to form a symbolic representation by the use of Mean and Standard Deviation. Further, term frequency vectors are used in the form a interval valued features. To cluster the term frequency vectors, we make use of Single Linkage, Complete Linkage, Average Linkage, K-Means and Fuzzy C-Means clustering algorithms. To corroborate the efficacy of the proposed model we conducted extensive experimentations on standard datasets like 20 Newsgroup Large, 20 Mini Newsgroup, Vehicles Wikipedia datasets and our own created datasets like Google Newsgroup and Research Article Abstracts. Experimental results reveal that the proposed model gives better results when compared to the state of the art techniques. In addition, as the method is based on a simple matching scheme, it requires a negligible time.
Keywords :
data analysis; data structures; fuzzy set theory; learning (artificial intelligence); pattern classification; pattern clustering; pattern matching; text analysis; K-means clustering algorithms; average linkage; complete linkage; free running text data; fuzzy C-means clustering algorithms; interval valued features; machine learning; matching scheme; mean; single linkage; standard datasets; standard deviation; symbolic data analysis; symbolic representation; term frequency vector clustering; text document classification; unconventional text document representation; unstructured data; Accuracy; Classification algorithms; Clustering algorithms; Couplings; Text categorization; Training; Vectors; Classification; Clustering Algorithms; Representation; Text Documents;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data and Smart Computing (BIGCOMP), 2014 International Conference on
Conference_Location :
Bangkok
Type :
conf
DOI :
10.1109/BIGCOMP.2014.6741438
Filename :
6741438
Link To Document :
بازگشت