مرکز منطقه ای اطلاع رساني علوم و فناوري - Some methods to address the problem of unbalanced sentiment classification in an arabic context

DocumentCode :

2622581

Title :

Some methods to address the problem of unbalanced sentiment classification in an arabic context

Author :

Mountassir, Asmaa ; Benbrahim, Houda ; Berrada, Ilham

Author_Institution :

ALBIRONI Res. Team, Mohamed 5 Univ., Rabat, Morocco

fYear :

2012

fDate :

22-24 Oct. 2012

Firstpage :

Lastpage :

Abstract :

The rise of social media (such as online web forums and social networking sites) has attracted interests to mining and analyzing opinions available on the web. The online opinion has become the object of studies in many research areas; especially that called “Opinion Mining and Sentiment Analysis”. Several interesting and advanced works were performed on few languages (in particular English). However, there were very few studies on some languages such as Arabic. This paper presents the study we have carried out to address the problem of unbalanced data sets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behavior of the classifier toward different under-sampling rates. We use two different common classifiers, namely Naïve Bayes and Support Vector Machines. The experiments are carried out on an Arabic data set that we have built from Aljazeera´s web site and labeled manually. The results show that Naïve Bayes is sensitive to data set size, the more we reduce the data the more the results degrade. However, it is not sensitive to unbalanced data sets on the contrary of Support Vector Machines which is highly sensitive to unbalanced data sets. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.

Keywords :

data mining; natural language processing; pattern classification; random processes; sampling methods; social networking (online); support vector machines; text analysis; Aljazeera Web site; Arabic data set size; Arabic languages; English languages; Naive-Bayes classifier; data labelling; document handling; online Web forums; online opinion analysis; online opinion mining; opinion mining-and-sentiment analysis; random under-sampling rates; social media; social networking sites; supervised sentiment classification; support vector machine classifier; unbalanced data sets; Accuracy; Classification algorithms; Clustering algorithms; Niobium; Radio frequency; Support vector machines; Training; Arabic Language; Corpora; Machine Learning; Natural Language Processing; Opinion Mining; Sentiment Analysis; Text Classification; Unbalanced Data sets;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Science and Technology (CIST), 2012 Colloquium in

Conference_Location :

Fez

Print_ISBN :

978-1-4673-2726-8

Electronic_ISBN :

978-1-4673-2724-4

Type :

conf

DOI :

10.1109/CIST.2012.6388061

Filename :

6388061

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2622581