مرکز منطقه ای اطلاع رساني علوم و فناوري - Automatic arabic Text Categorization using Bayesian learning

DocumentCode :

607280

Title :

Automatic arabic Text Categorization using Bayesian learning

Author :

Kadhim, M.H. ; Omar, Normaliza

Author_Institution :

Sch. of Comput. Sci., Univ. Kebangsaan Malaysia (UKM), Bangi, Malaysia

fYear :

2012

fDate :

3-5 Dec. 2012

Firstpage :

415

Lastpage :

419

Abstract :

Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

Keywords :

Bayes methods; learning (artificial intelligence); text analysis; 1-gram; 2-gram; 3-gram; ATC; Arabic stemming; Bayesian learning classifiers; CHI; Chi-Square statistic; FB; GSS coefficient; MBNB; MGNB; MI; ML techniques; OR; TREC-2002 light stemmer; automatic Arabic text categorization; complex morphology; electronic document categorization; feature selection techniques; flexible Bayes; multinomial Naive Bayes; multivariate Bernoulli Naive Bayes; multivariate guess Naive Bayes; mutual information; odd ratio; supervised machine learning techniques; text representation; word level n-gram; Arabic text categorization; Bayesian learning; Feature Selection (FS); automatic text categorization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computing and Convergence Technology (ICCCT), 2012 7th International Conference on

Conference_Location :

Seoul

Print_ISBN :

978-1-4673-0894-6

Type :

conf

Filename :

6530369

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=607280