A scalable method for instance selection for class-imbalance datasets

Author

De Haro-García, Aida ; García-Pedrajas, Nicolás

Author_Institution

Dept. of Comput. & Numerical Anal., Univ. of Cordoba, Cordoba, Spain

fYear

2011

fDate

22-24 Nov. 2011

Firstpage

1383

Lastpage

1390

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. Research areas such as bioinformatics, text mining and intrusion detection, are generating huge amounts of information that must be dealt with. Instance selection is a powerful tool to reduce that information to manageable datasets. Most of the datasets in these areas shares a common property, they are heavily class-imbalanced. The class of interest, or positive or minority class, is outnumbered many times by the negative, or majority, class. Thus, any instance selection algorithm addressing these problems must take into account two important features of such problems. Firstly, the large size of the datasets that makes scalability issues very relevant. Secondly, the class-imbalanced distribution of the instances. In this paper, we propose a new methodology for instance selection that it is specifically designed for large class-imbalanced datasets. We use a divide-and-conquer approach to deal with the scalability of the algorithms, and a combination of different rounds of instance selection to improve the results in terms of class-imbalance error measures. The validity of the proposed framework is assured using 45 datasets. Our proposal improves the results of standard methods in accuracy and storage reduction, and at the same time is able to reduce the time needed by the algorithms with a time complexity O(n log(n)).

Keywords

computational complexity; data mining; divide and conquer methods; storage management; very large databases; bioinformatics; class-imbalance datasets; class-imbalance error measures; class-imbalanced distribution; divide-and-conquer approach; heavily class-imbalanced; information reduction; instance selection algorithm; intrusion detection; manageable datasets; minority class; positive class; scalable method; storage reduction; text mining; time complexity; Accuracy; Algorithm design and analysis; Complexity theory; Partitioning algorithms; Proposals; Scalability; Training; Class-imbalanced problems; Data mining; Instance selection; Scaling up;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on

Conference_Location

Cordoba

ISSN

2164-7143

Print_ISBN

978-1-4577-1676-8

Type

conf

DOI

10.1109/ISDA.2011.6121853

Filename

6121853