DocumentCode :
607280
Title :
Automatic arabic Text Categorization using Bayesian learning
Author :
Kadhim, M.H. ; Omar, Normaliza
Author_Institution :
Sch. of Comput. Sci., Univ. Kebangsaan Malaysia (UKM), Bangi, Malaysia
fYear :
2012
fDate :
3-5 Dec. 2012
Firstpage :
415
Lastpage :
419
Abstract :
Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.
Keywords :
Bayes methods; learning (artificial intelligence); text analysis; 1-gram; 2-gram; 3-gram; ATC; Arabic stemming; Bayesian learning classifiers; CHI; Chi-Square statistic; FB; GSS coefficient; MBNB; MGNB; MI; ML techniques; OR; TREC-2002 light stemmer; automatic Arabic text categorization; complex morphology; electronic document categorization; feature selection techniques; flexible Bayes; multinomial Naive Bayes; multivariate Bernoulli Naive Bayes; multivariate guess Naive Bayes; mutual information; odd ratio; supervised machine learning techniques; text representation; word level n-gram; Arabic text categorization; Bayesian learning; Feature Selection (FS); automatic text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computing and Convergence Technology (ICCCT), 2012 7th International Conference on
Conference_Location :
Seoul
Print_ISBN :
978-1-4673-0894-6
Type :
conf
Filename :
6530369
Link To Document :
بازگشت