• DocumentCode
    3673630
  • Title

    Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data

  • Author

    Joseph Prusa;Taghi M. Khoshgoftaar;David J. Dittman;Amri Napolitano

  • Author_Institution
    Florida Atlantic Univ., Boca Raton, FL, USA
  • fYear
    2015
  • Firstpage
    197
  • Lastpage
    202
  • Abstract
    Sentiment classification of tweets is used for a variety of social sensing tasks and provides a means of discerning public opinion on a wide range of topics. A potential concern when performing sentiment classification is that the training data may contain class imbalance, which can negatively affect classification performance. A classifier trained on imbalanced data may be biased in favor of the majority class. One possibile method of addressing this is to use data sampling to achieve a more balanced class distribution. In this work, we seek to observe how data sampling (using random undersampling with either a 50:50 or 35:65 positive:negative post-sampling class distribution ratio) affects the classification performance on tweet sentiment data. Our experimental results show that Random Undersampling significantly improves classification performance in comparison to not using any data sampling. Furthermore, there is no significant difference between selecting a 50:50 or 35:65 post-sampling class distribution ratio.
  • Keywords
    "Feature extraction","Support vector machines","Training","Data mining","Training data","Analysis of variance","Machine learning algorithms"
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration (IRI), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/IRI.2015.39
  • Filename
    7300975