Selecting samples for labeling in unbalanced streaming data environments

Author

Hanqing Hu ; Kantardzic, Mehmed M. ; Sethi, Tegjyot Singh

Author_Institution

CECS Dept., Univ. of Louisville, Louisville, KY, USA

fYear

2013

fDate

Oct. 30 2013-Nov. 1 2013

Firstpage

1

Lastpage

7

Abstract

In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.

Keywords

data analysis; pattern classification; pattern clustering; random processes; sampling methods; classification model training; data point labelling; data set dimensionality reduction; data space; eight-dimension data set improvement; extremely-unbalanced stream data set labeling; grid density algorithm; grid sizes; human resources; minority class clusters; minority-class samples; random sample selection; sampling efficiency improvement; streaming data ensemble classifier update; synthetic data set; thirty-six-dimension data set improvement; unbalanced streaming data environment; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Data mining; Data models; Labeling; Training; Classification; Grid Density; Labeling; Stream Data;

fLanguage

English

Publisher

ieee

Conference_Titel

Information, Communication and Automation Technologies (ICAT), 2013 XXIV International Symposium on

Conference_Location

Sarajevo

Type

conf

DOI

10.1109/ICAT.2013.6684046

Filename

6684046