Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification

Author

Río;José Manuel Benítez;Francisco Herrera

Author_Institution

Dept. of Comput. Sci. &

Volume

2

fYear

2015

Firstpage

180

Lastpage

185

Abstract

The "big data" term has caught the attention of experts in the context of learning from data. This term is used to describe the exponential growth and availability of data (structured and unstructured). The design of effective models that can process and extract useful knowledge from these data represents a immense challenge. Focusing on classification problems, many real-world applications present a class distribution where one or more classes are represented by a large number of examples with respect to the negligible number of examples of other classes, which are precisely those of primary interest. This circumstance is known as the problem of classification with imbalanced datasets. In this work, we analyze a hypothesis in order to increment the accuracy of the underrepresented class when dealing with extremely imbalanced big data problems under the MapReduce framework. The performance of our solution has been analyzed in an experimental study that is carried out over the extremely imbalanced big data problem that was used in the ECBDL´14 Big Data Competition. The results obtained show that is necessary to find a balance between the classes in order to obtain the highest precision.

Keywords

"Big data","Data models","Bioinformatics","Programming","Feature extraction","Information and communication technology"

Publisher

ieee

Conference_Titel

Trustcom/BigDataSE/ISPA, 2015 IEEE

Type

conf

DOI

10.1109/Trustcom.2015.579

Filename

7345492