Author/Authors :
Khanijazani, I Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran , Salami, D Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran , Rahbar, A Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran , Momtazi, S Computer Engineering and Information Technology Department - Amirkabir University of Technology - Tehran, Iran
Abstract :
Text clustering and classification are two main tasks of text mining. Feature selection plays a key role in the
quality of the clustering and classification results. Although word-based features such as Term Frequency-
Inverse Document Frequency (TF-IDF) vectors have been widely used in different applications, their
shortcomings in capturing semantic concepts of text have motivated researches to use semantic models for
document vector representations. The Latent Dirichlet Allocation (LDA) topic modeling and doc2vec neural
document embedding are two well-known techniques for this purpose.
In this work, we first studied the conceptual difference between the two models and showed that they had
different behaviors and capture semantic features of texts from different perspectives. We then proposed a
hybrid approach for document vector representation to benefit from the advantages of both models. The
experimental results on 20newsgroup showed the superiority of the proposed model compared to each one of
the baselines on both text clustering and classification tasks. We achieved a 2.6% improvement in F-measure
for text clustering and a 2.1% improvement in F-measure in text classification compared to the best baseline
model.
Keywords :
Neural Document Embedding , Topic Modeling , Semantic Representation , Text Mining