Title :
Selecting samples for labeling in unbalanced streaming data environments
Author :
Hanqing Hu ; Kantardzic, Mehmed M. ; Sethi, Tegjyot Singh
Author_Institution :
CECS Dept., Univ. of Louisville, Louisville, KY, USA
fDate :
Oct. 30 2013-Nov. 1 2013
Abstract :
In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.
Keywords :
data analysis; pattern classification; pattern clustering; random processes; sampling methods; classification model training; data point labelling; data set dimensionality reduction; data space; eight-dimension data set improvement; extremely-unbalanced stream data set labeling; grid density algorithm; grid sizes; human resources; minority class clusters; minority-class samples; random sample selection; sampling efficiency improvement; streaming data ensemble classifier update; synthetic data set; thirty-six-dimension data set improvement; unbalanced streaming data environment; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Data mining; Data models; Labeling; Training; Classification; Grid Density; Labeling; Stream Data;
Conference_Titel :
Information, Communication and Automation Technologies (ICAT), 2013 XXIV International Symposium on
Conference_Location :
Sarajevo
DOI :
10.1109/ICAT.2013.6684046