Title :
A scalable method for instance selection for class-imbalance datasets
Author :
De Haro-García, Aida ; García-Pedrajas, Nicolás
Author_Institution :
Dept. of Comput. & Numerical Anal., Univ. of Cordoba, Cordoba, Spain
Abstract :
Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. Research areas such as bioinformatics, text mining and intrusion detection, are generating huge amounts of information that must be dealt with. Instance selection is a powerful tool to reduce that information to manageable datasets. Most of the datasets in these areas shares a common property, they are heavily class-imbalanced. The class of interest, or positive or minority class, is outnumbered many times by the negative, or majority, class. Thus, any instance selection algorithm addressing these problems must take into account two important features of such problems. Firstly, the large size of the datasets that makes scalability issues very relevant. Secondly, the class-imbalanced distribution of the instances. In this paper, we propose a new methodology for instance selection that it is specifically designed for large class-imbalanced datasets. We use a divide-and-conquer approach to deal with the scalability of the algorithms, and a combination of different rounds of instance selection to improve the results in terms of class-imbalance error measures. The validity of the proposed framework is assured using 45 datasets. Our proposal improves the results of standard methods in accuracy and storage reduction, and at the same time is able to reduce the time needed by the algorithms with a time complexity O(n log(n)).
Keywords :
computational complexity; data mining; divide and conquer methods; storage management; very large databases; bioinformatics; class-imbalance datasets; class-imbalance error measures; class-imbalanced distribution; divide-and-conquer approach; heavily class-imbalanced; information reduction; instance selection algorithm; intrusion detection; manageable datasets; minority class; positive class; scalable method; storage reduction; text mining; time complexity; Accuracy; Algorithm design and analysis; Complexity theory; Partitioning algorithms; Proposals; Scalability; Training; Class-imbalanced problems; Data mining; Instance selection; Scaling up;
Conference_Titel :
Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on
Conference_Location :
Cordoba
Print_ISBN :
978-1-4577-1676-8
DOI :
10.1109/ISDA.2011.6121853