مرکز منطقه ای اطلاع رساني علوم و فناوري - LDA-based keyword selection in text categorization

DocumentCode :

3633642

Title :

LDA-based keyword selection in text categorization

Author :

Serafettin Tasci;Tunga Gungor

Author_Institution :

Comput. Eng. Dept., Bogazici Univ., Istanbul, Turkey

fYear :

2009

Firstpage :

230

Lastpage :

235

Abstract :

Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, feature selection is broadly used in text categorization systems for reducing the dimensionality. In the literature, there are some widely known metrics such as information gain and document frequency thresholding. Recently, a generative graphical model called latent dirichlet allocation (LDA) that can be used to model and discover the underlying topic structures of textual data, was proposed. In this paper, we use the hidden topic analysis of LDA for feature selection and compare it with the classical feature selection metrics in text categorization. For the experiments, we use SVM as the classifier and tf∗idf weighting for weighting the terms. We observed that almost in all metrics, information gain performs best at all keyword numbers while the LDA-based metrics perform similar to chi-square and document frequency thresholding.

Keywords :

"Text categorization","Support vector machines","Linear discriminant analysis","Support vector machine classification","Frequency","Graphical models","Induction generators","Performance gain","Statistics","Classification algorithms"

Publisher :

ieee

Conference_Titel :

Computer and Information Sciences, 2009. ISCIS 2009. 24th International Symposium on

Print_ISBN :

978-1-4244-5021-3

Type :

conf

DOI :

10.1109/ISCIS.2009.5291818

Filename :

5291818

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3633642