مرکز منطقه ای اطلاع رساني علوم و فناوري - A Joint Semantic Vector Representation Model for Text Clustering and Classification

Title of article :

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Author/Authors :

Khanijazani, I Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran , Salami, D Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran , Rahbar, A Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran , Momtazi, S Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran

Pages :

From page :

443

To page :

450

Abstract :

Text clustering and classification are two main tasks of text mining. Feature selection plays a key role in the quality of the clustering and classification results. Although word-based features such as Term Frequency- Inverse Document Frequency (TF-IDF) vectors have been widely used in different applications, their shortcomings in capturing semantic concepts of text have motivated researches to use semantic models for document vector representations. The Latent Dirichlet Allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose. In this work, we first studied the conceptual difference between the two models and showed that they had different behaviors and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup showed the superiority of the proposed model compared to each one of the baselines on both text clustering and classification tasks. We achieved a 2.6% improvement in F-measure for text clustering and a 2.1% improvement in F-measure in text classification compared to the best baseline model.

Keywords :

Neural Document Embedding , Topic Modeling , Semantic Representation , Text Mining

Journal title :

Astroparticle Physics

Serial Year :

2019

Record number :

2453045

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=2453045