Title of article :
A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification
Author/Authors :
Davoudi, Saeedeh School of Engineering Science - College of Engineering - University of Tehran, Tehran, Iran , Mirzaei, Sayeh School of Engineering Science - College of Engineering - University of Tehran, Tehran, Iran
Pages :
8
From page :
28
To page :
35
Abstract :
Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. Recently, a high volume of text data has been generated through the Internet. This kind of data is a valuable source of information that can be used in various fields such as information retrieval, recommender systems, etc. One practical task of text mining is document classification. In this paper, we mainly focus on Persian document classification. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) and use different classification models to compare the performance of our approach with other techniques like Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Latent Dirichlet Allocation (LDA) methods. Our proposed method resulted in an improvement in the obtained accuracy of all classifiers in comparison with other techniques.
Keywords :
K-Means , LDA , GB , MLP , CC-Word2Vec , Word2Vec , TF-IDF , Persian document classification
Journal title :
The CSI Journal on Computer Science and Engineering (JCSE)
Serial Year :
2020
Record number :
2704302
Link To Document :
بازگشت