An empirical comparison of repetitive undersampling techniques

Author

Van Hulse, Jason ; Khoshgoftaar, Taghi M. ; Napolitano, Amri

Author_Institution

Dept. of Comput. Sci. & Eng., Florida Atlantic Univ., Boca Raton, FL, USA

fYear

2009

fDate

10-12 Aug. 2009

Firstpage

29

Lastpage

34

Abstract

A common problem for data mining and machine learning practitioners is class imbalance. When examples of one class greatly outnumber examples of the other class (es), traditional machine learning algorithms can perform poorly. Random undersampling is a technique that has shown great potential for alleviating the problem of class imbalance. However, undersampling leads to information loss which can hinder classification performance in some cases. To overcome this problem, repetitive undersampling techniques have been proposed. These techniques generate an ensemble of models, each trained on a different, undersampled subset of the training data. In doing so, less information is lost and classification performance is improved. In this study, we evaluate the performance of several repetitive undersampling techniques. To our knowledge, no study has so thoroughly compared repetitive undersampling techniques.

Keywords

data mining; learning (artificial intelligence); pattern classification; class imbalance; data mining; empirical comparison; hinder classification; machine learning; repetitive undersampling technique; Application software; Computer science; Data engineering; Data mining; Machine learning; Machine learning algorithms; Medical diagnosis; Performance loss; Sampling methods; Training data;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Reuse & Integration, 2009. IRI '09. IEEE International Conference on

Conference_Location

Las Vegas, NV

Print_ISBN

978-1-4244-4114-3

Electronic_ISBN

978-1-4244-4116-7

Type

conf

DOI

10.1109/IRI.2009.5211614

Filename

5211614