• DocumentCode
    268117
  • Title

    OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets

  • Author

    García-Pedrajas, Nicolás ; Perez-Rodríguez, Javier ; de Haro-García, Aida

  • Author_Institution
    Dept. of Comput. & Numerical Anal., Univ. of Cordoba, Cordoba, Spain
  • Volume
    43
  • Issue
    1
  • fYear
    2013
  • fDate
    Feb. 2013
  • Firstpage
    332
  • Lastpage
    346
  • Abstract
    In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method´s ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
  • Keywords
    data mining; divide and conquer methods; sampling methods; OligoIS; class-imbalance problem; class-imbalanced data sets; class-imbalanced medium-sized data sets; class-imbalanced sample distribution; data mining algorithms; divide-and-conquer principle; large data set sample distribution; scalable instance selection; state-of-the-art instance selection methods; Accuracy; Approximation algorithms; Blades; Evolutionary computation; Proposals; Scalability; Training; Class-imbalance problem; instance selection; instance-based learning; very large problems;
  • fLanguage
    English
  • Journal_Title
    Cybernetics, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    2168-2267
  • Type

    jour

  • DOI
    10.1109/TSMCB.2012.2206381
  • Filename
    6253271