DocumentCode
80071
Title
Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?
Author
Jing-Hao Xue ; Hall, Peter
Author_Institution
Dept. of Stat. Sci., Univ. Coll. London, London, UK
Volume
37
Issue
5
fYear
2015
fDate
May 1 2015
Firstpage
1109
Lastpage
1112
Abstract
Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.
Keywords
Gaussian processes; pattern classification; sampling methods; AUC; Gaussian-based linear discriminant analysis; LDA; class sizes; class-unbalanced data rebalancing; classification performance; classifiers; majority class undersampling; minority class oversampling; quantitative measure; rebalanced training data; receiver operating characteristic curve; training dataset; unbalanced test data; Covariance matrices; Data mining; Educational institutions; Linear discriminant analysis; Training; Training data; Vectors; AUC; ROC; class imbalance; class rebalancing; linear discriminant analysis; oversampling; undersampling;
fLanguage
English
Journal_Title
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher
ieee
ISSN
0162-8828
Type
jour
DOI
10.1109/TPAMI.2014.2359660
Filename
6906278
Link To Document