مرکز منطقه ای اطلاع رساني علوم و فناوري - Wikipedia-Based Kernels for Text Categorization

DocumentCode :

2858887

Title :

Wikipedia-Based Kernels for Text Categorization

Author :

Minier, Zsolt ; Bodó, Zalán ; Csató, Lehel

Author_Institution :

Babes-Bolyai Univ., Cluj-Napoca

fYear :

2007

fDate :

26-29 Sept. 2007

Firstpage :

157

Lastpage :

164

Abstract :

In recent years several models have been proposed for text categorization. Within this, one of the widely applied models is the vector space model (VSM), where independence between indexing terms, usually words, is assumed. Since training corpora sizes are relatively small - compared to ap infin what would be required for a realistic number of words - the generalization power of the learning algorithms is low. It is assumed that a bigger text corpus can boost the representation and hence the learning process. Based on the work of Gabrilovich and Markovitch [6], we incorporate Wikipedia articles into the system to give word distributional representation for documents. The extension with this new corpus causes dimensionality increase, therefore clustering of features is needed. We use latent semantic analysis (LSA), kernel principal component analysis (KPCA) and kernel canonical correlation analysis (KCCA) and present results for these experiments on the Reuters corpus.

Keywords :

pattern clustering; text analysis; word processing; Reuters corpus; Wikipedia articles; Wikipedia-based kernels; features clustering; indexing terms; kernel canonical correlation analysis; kernel principal component analysis; latent semantic analysis; learning algorithms; text categorization; vector space model; Computer science; Frequency; Indexing; Information retrieval; Kernel; Machine learning; Mathematics; Scientific computing; Text categorization; Wikipedia;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Symbolic and Numeric Algorithms for Scientific Computing, 2007. SYNASC. International Symposium on

Conference_Location :

Timisoara

Print_ISBN :

978-0-7695-3078-8

Type :

conf

DOI :

10.1109/SYNASC.2007.8

Filename :

4438094

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2858887