• DocumentCode
    3704182
  • Title

    Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification

  • Author

    Río;José Manuel Benítez;Francisco Herrera

  • Author_Institution
    Dept. of Comput. Sci. &
  • Volume
    2
  • fYear
    2015
  • Firstpage
    180
  • Lastpage
    185
  • Abstract
    The "big data" term has caught the attention of experts in the context of learning from data. This term is used to describe the exponential growth and availability of data (structured and unstructured). The design of effective models that can process and extract useful knowledge from these data represents a immense challenge. Focusing on classification problems, many real-world applications present a class distribution where one or more classes are represented by a large number of examples with respect to the negligible number of examples of other classes, which are precisely those of primary interest. This circumstance is known as the problem of classification with imbalanced datasets. In this work, we analyze a hypothesis in order to increment the accuracy of the underrepresented class when dealing with extremely imbalanced big data problems under the MapReduce framework. The performance of our solution has been analyzed in an experimental study that is carried out over the extremely imbalanced big data problem that was used in the ECBDL´14 Big Data Competition. The results obtained show that is necessary to find a balance between the classes in order to obtain the highest precision.
  • Keywords
    "Big data","Data models","Bioinformatics","Programming","Feature extraction","Information and communication technology"
  • Publisher
    ieee
  • Conference_Titel
    Trustcom/BigDataSE/ISPA, 2015 IEEE
  • Type

    conf

  • DOI
    10.1109/Trustcom.2015.579
  • Filename
    7345492