• DocumentCode
    607280
  • Title

    Automatic arabic Text Categorization using Bayesian learning

  • Author

    Kadhim, M.H. ; Omar, Normaliza

  • Author_Institution
    Sch. of Comput. Sci., Univ. Kebangsaan Malaysia (UKM), Bangi, Malaysia
  • fYear
    2012
  • fDate
    3-5 Dec. 2012
  • Firstpage
    415
  • Lastpage
    419
  • Abstract
    Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naïve Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naïve Bayes (MBNB), and Multinomial Naïve Bayes (MNB). For text representation in terms of word level N-Gram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.
  • Keywords
    Bayes methods; learning (artificial intelligence); text analysis; 1-gram; 2-gram; 3-gram; ATC; Arabic stemming; Bayesian learning classifiers; CHI; Chi-Square statistic; FB; GSS coefficient; MBNB; MGNB; MI; ML techniques; OR; TREC-2002 light stemmer; automatic Arabic text categorization; complex morphology; electronic document categorization; feature selection techniques; flexible Bayes; multinomial Naive Bayes; multivariate Bernoulli Naive Bayes; multivariate guess Naive Bayes; mutual information; odd ratio; supervised machine learning techniques; text representation; word level n-gram; Arabic text categorization; Bayesian learning; Feature Selection (FS); automatic text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computing and Convergence Technology (ICCCT), 2012 7th International Conference on
  • Conference_Location
    Seoul
  • Print_ISBN
    978-1-4673-0894-6
  • Type

    conf

  • Filename
    6530369