DocumentCode :
3633642
Title :
LDA-based keyword selection in text categorization
Author :
Serafettin Tasci;Tunga Gungor
Author_Institution :
Comput. Eng. Dept., Bogazici Univ., Istanbul, Turkey
fYear :
2009
Firstpage :
230
Lastpage :
235
Abstract :
Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, feature selection is broadly used in text categorization systems for reducing the dimensionality. In the literature, there are some widely known metrics such as information gain and document frequency thresholding. Recently, a generative graphical model called latent dirichlet allocation (LDA) that can be used to model and discover the underlying topic structures of textual data, was proposed. In this paper, we use the hidden topic analysis of LDA for feature selection and compare it with the classical feature selection metrics in text categorization. For the experiments, we use SVM as the classifier and tf∗idf weighting for weighting the terms. We observed that almost in all metrics, information gain performs best at all keyword numbers while the LDA-based metrics perform similar to chi-square and document frequency thresholding.
Keywords :
"Text categorization","Support vector machines","Linear discriminant analysis","Support vector machine classification","Frequency","Graphical models","Induction generators","Performance gain","Statistics","Classification algorithms"
Publisher :
ieee
Conference_Titel :
Computer and Information Sciences, 2009. ISCIS 2009. 24th International Symposium on
Print_ISBN :
978-1-4244-5021-3
Type :
conf
DOI :
10.1109/ISCIS.2009.5291818
Filename :
5291818
Link To Document :
بازگشت