• DocumentCode
    2477341
  • Title

    An empirical study to address the problem of Unbalanced Data Sets in sentiment classification

  • Author

    Mountassir, Asmaa ; Benbrahim, Houda ; Berrada, Ilham

  • Author_Institution
    ALBIRONI Res. Team, Mohamed 5 Univ., Rabat, Morocco
  • fYear
    2012
  • fDate
    14-17 Oct. 2012
  • Firstpage
    3298
  • Lastpage
    3303
  • Abstract
    With the emergence of Web 2.0, Sentiment Analysis is receiving more and more attention. Several interesting works were performed to address different issues in Sentiment Analysis. Nevertheless, the problem of Unbalanced Data Sets was not enough tackled within this research area. This paper presents the study we have carried out to address the problem of unbalanced data sets in supervised sentiment classification in a multi-lingual context. We propose three different methods to under-sample the majority class documents. These methods are Remove Similar, Remove Farthest and Remove by Clustering. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behavior of the classifiers toward different under-sampling rates. We use three different common classifiers, namely Naïve Bayes, Support Vector Machines and k-Nearest Neighbors. The experiments are carried out on two Arabic data sets and an English data set. We show that the four under-sampling methods are typically competitive. Naïve Bayes is shown as insensitive to unbalanced data sets. But Support Vector Machines seems to be highly sensitive to unbalanced data sets; k-Nearest Neighbors shows a slight sensitivity to imbalance in comparison with Support Vector Machines.
  • Keywords
    Bayes methods; Internet; data mining; natural language processing; pattern classification; pattern clustering; support vector machines; text analysis; Arabic data set; English data set; Remove Farthest; Remove Similar method; Remove by Clustering method; Web 2.0; k-nearest neighbors; majority class documents; multilingual context; naive Bayes; opinion mining; random undersampling; sentiment analysis; supervised sentiment classification; support vector machines; unbalanced data set; undersampling rate; Accuracy; Labeling; Niobium; Radio frequency; Sampling methods; Support vector machines; Training; Machine Learning; Natural Language Processing; Opinion Mining; Sentiment Analysis; Text Classification; Unbalanced Data sets;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on
  • Conference_Location
    Seoul
  • Print_ISBN
    978-1-4673-1713-9
  • Electronic_ISBN
    978-1-4673-1712-2
  • Type

    conf

  • DOI
    10.1109/ICSMC.2012.6378300
  • Filename
    6378300