Title :
Cleansing Noisy Data Streams
Author :
Zhu, Xingquan ; Zhang, Peng ; Wu, Xindong ; He, Dan ; Zhang, Chengqi ; Shi, Yong
Author_Institution :
Dept. of Comput. Sci. & Eng., Florida Atlantic Univ., Boca Raton, FL
Abstract :
In this paper, we identify a new research problem on cleansing noisy data streams which contain incorrectly labeled training examples. The objective is to accurately identify and remove mislabeled data, such that the prediction models built from the cleansed streams can be more accurate than the ones trained from the raw noisy streams. For this purpose, we first use bias-variance decomposition to derive a maximum variance margin (MVM) principle for stream data cleansing. Following this principle, we further propose a local and global filtering (LgF) framework to combine the strength of local noise filtering (within one single data chunk) and global noise filtering (across a number of adjacent data chunks) to identify erroneous data. Experimental results on six data streams (including two real-world data streams) demonstrate that LgF significantly outperforms simple methods in identifying noisy examples.
Keywords :
data mining; filtering theory; noise; bias-variance decomposition; data cleansing; global filtering; global noise filtering; incorrectly labeled training; local noise filtering; maximum variance margin principle; mislabeled data; noisy data streams; Computer science; Data engineering; Data mining; Filtering; Information technology; Predictive models; Supervised learning; USA Councils; Voting; Working environment noise; Data mining; classification; data cleansing; data streams;
Conference_Titel :
Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
Conference_Location :
Pisa
Print_ISBN :
978-0-7695-3502-9
DOI :
10.1109/ICDM.2008.45