DocumentCode
3673630
Title
Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data
Author
Joseph Prusa;Taghi M. Khoshgoftaar;David J. Dittman;Amri Napolitano
Author_Institution
Florida Atlantic Univ., Boca Raton, FL, USA
fYear
2015
Firstpage
197
Lastpage
202
Abstract
Sentiment classification of tweets is used for a variety of social sensing tasks and provides a means of discerning public opinion on a wide range of topics. A potential concern when performing sentiment classification is that the training data may contain class imbalance, which can negatively affect classification performance. A classifier trained on imbalanced data may be biased in favor of the majority class. One possibile method of addressing this is to use data sampling to achieve a more balanced class distribution. In this work, we seek to observe how data sampling (using random undersampling with either a 50:50 or 35:65 positive:negative post-sampling class distribution ratio) affects the classification performance on tweet sentiment data. Our experimental results show that Random Undersampling significantly improves classification performance in comparison to not using any data sampling. Furthermore, there is no significant difference between selecting a 50:50 or 35:65 post-sampling class distribution ratio.
Keywords
"Feature extraction","Support vector machines","Training","Data mining","Training data","Analysis of variance","Machine learning algorithms"
Publisher
ieee
Conference_Titel
Information Reuse and Integration (IRI), 2015 IEEE International Conference on
Type
conf
DOI
10.1109/IRI.2015.39
Filename
7300975
Link To Document