Combining nearest neighbour classifiers based on small subsamples for big data analytics

Author

Krawczyk, Bartosz ; Wozniak, Michal

Author_Institution

Dept. of Syst. & Comput. Networks, Wroclaw Univ. of Technol., Wrocław, Poland

fYear

2015

fDate

24-26 June 2015

Firstpage

311

Lastpage

316

Abstract

Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.

Keywords

Big Data; computational complexity; data analysis; learning (artificial intelligence); pattern classification; sampling methods; aggregate dataset; big data analytics; classification time; classifier decision; computational complexity; data volume; dataset random undersampling; distance-based classifier; distributed environment; ensemble member; machine learning system; nearest neighbour classifier; stationary data; Accuracy; Aggregates; Big data; Computer architecture; Couplings; Prototypes; Training; big data; classifier ensemble; distributed classifier; machine learning; parallel computing;

fLanguage

English

Publisher

ieee

Conference_Titel

Cybernetics (CYBCONF), 2015 IEEE 2nd International Conference on

Conference_Location

Gdynia

Print_ISBN

978-1-4799-8320-9

Type

conf

DOI

10.1109/CYBConf.2015.7175952

Filename

7175952