• DocumentCode
    8010
  • Title

    RACOG and wRACOG: Two Probabilistic Oversampling Techniques

  • Author

    Das, Biswajit ; Krishnan, Narayanan C. ; Cook, Diane J.

  • Author_Institution
    Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA, USA
  • Volume
    27
  • Issue
    1
  • fYear
    2015
  • fDate
    Jan. 1 2015
  • Firstpage
    222
  • Lastpage
    234
  • Abstract
    As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not represented well which leads to high misclassification error. We introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The proposed approaches use the joint probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While RACOG selects samples produced by the Gibbs sampler based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using nine UCI data sets that were carefully modified to exhibit class imbalance and one new application domain data set with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.
  • Keywords
    Markov processes; Monte Carlo methods; learning (artificial intelligence); pattern classification; statistical distributions; Gibbs sampler; Gibbs sampling; UCI data sets; class labels; classification performance; data attributes joint probability distribution; imbalanced class distribution problem; machine learning techniques; minority class probability distribution; probabilistic oversampling techniques; wRACOG; Approximation algorithms; Approximation methods; Joints; Kernel; Machine learning algorithms; Probabilistic logic; Probability distribution; Gibbs sampling; Imbalanced class distribution; approximating joint probability distribution; probabilistic oversampling;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2014.2324567
  • Filename
    6816044