DocumentCode :
1019221
Title :
SVMs Modeling for Highly Imbalanced Classification
Author :
Tang, Yuchun ; Zhang, Yan-Qing ; Chawla, Nitesh V. ; Krasser, Sven
Author_Institution :
McAfee Inc., Alpharetta, GA
Volume :
39
Issue :
1
fYear :
2009
Firstpage :
281
Lastpage :
288
Abstract :
Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different ldquorebalancerdquo heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G-mean, area under the receiver operating characteristic curve, F-measure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this paper, the novel granular SVMs-repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
Keywords :
pattern classification; support vector machines; SVM modeling; cost-sensitive learning; highly imbalanced classification algorithms; support vector machines; Computational intelligence; cost-sensitive learning; granular computing; highly imbalanced classification; oversampling; support vector machines (SVMs); undersampling; Algorithms; Area Under Curve; Artificial Intelligence; Cluster Analysis; Computer Simulation; Data Interpretation, Statistical; Pattern Recognition, Automated; ROC Curve;
fLanguage :
English
Journal_Title :
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
Publisher :
ieee
ISSN :
1083-4419
Type :
jour
DOI :
10.1109/TSMCB.2008.2002909
Filename :
4695979
Link To Document :
بازگشت