• DocumentCode
    80071
  • Title

    Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?

  • Author

    Jing-Hao Xue ; Hall, Peter

  • Author_Institution
    Dept. of Stat. Sci., Univ. Coll. London, London, UK
  • Volume
    37
  • Issue
    5
  • fYear
    2015
  • fDate
    May 1 2015
  • Firstpage
    1109
  • Lastpage
    1112
  • Abstract
    Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.
  • Keywords
    Gaussian processes; pattern classification; sampling methods; AUC; Gaussian-based linear discriminant analysis; LDA; class sizes; class-unbalanced data rebalancing; classification performance; classifiers; majority class undersampling; minority class oversampling; quantitative measure; rebalanced training data; receiver operating characteristic curve; training dataset; unbalanced test data; Covariance matrices; Data mining; Educational institutions; Linear discriminant analysis; Training; Training data; Vectors; AUC; ROC; class imbalance; class rebalancing; linear discriminant analysis; oversampling; undersampling;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/TPAMI.2014.2359660
  • Filename
    6906278