• DocumentCode
    3394414
  • Title

    Combining nearest neighbour classifiers based on small subsamples for big data analytics

  • Author

    Krawczyk, Bartosz ; Wozniak, Michal

  • Author_Institution
    Dept. of Syst. & Comput. Networks, Wroclaw Univ. of Technol., Wrocław, Poland
  • fYear
    2015
  • fDate
    24-26 June 2015
  • Firstpage
    311
  • Lastpage
    316
  • Abstract
    Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.
  • Keywords
    Big Data; computational complexity; data analysis; learning (artificial intelligence); pattern classification; sampling methods; aggregate dataset; big data analytics; classification time; classifier decision; computational complexity; data volume; dataset random undersampling; distance-based classifier; distributed environment; ensemble member; machine learning system; nearest neighbour classifier; stationary data; Accuracy; Aggregates; Big data; Computer architecture; Couplings; Prototypes; Training; big data; classifier ensemble; distributed classifier; machine learning; parallel computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cybernetics (CYBCONF), 2015 IEEE 2nd International Conference on
  • Conference_Location
    Gdynia
  • Print_ISBN
    978-1-4799-8320-9
  • Type

    conf

  • DOI
    10.1109/CYBConf.2015.7175952
  • Filename
    7175952