Title :
Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?
Author :
Jing-Hao Xue ; Hall, Peter
Author_Institution :
Dept. of Stat. Sci., Univ. Coll. London, London, UK
Abstract :
Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.
Keywords :
Gaussian processes; pattern classification; sampling methods; AUC; Gaussian-based linear discriminant analysis; LDA; class sizes; class-unbalanced data rebalancing; classification performance; classifiers; majority class undersampling; minority class oversampling; quantitative measure; rebalanced training data; receiver operating characteristic curve; training dataset; unbalanced test data; Covariance matrices; Data mining; Educational institutions; Linear discriminant analysis; Training; Training data; Vectors; AUC; ROC; class imbalance; class rebalancing; linear discriminant analysis; oversampling; undersampling;
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
DOI :
10.1109/TPAMI.2014.2359660