DocumentCode
3269405
Title
Scaling clustering algorithms for massive data sets using data streams
Author
Nittel, Silvia ; Leung, Kelvin T. ; Braverman, Amy
Author_Institution
SIE, Maine Univ., Orono, ME, USA
fYear
2004
fDate
30 March-2 April 2004
Firstpage
830
Abstract
Computing clustering techniques on massive data sets is still not feasible nor efficient today. For instance, raw satellite imagery data can be replaced with compressed counterparts for many scientific applications. However, to facilitate scientific data analysis the high order correlation between the attributes in the data set as well as their nonparametric distribution must be preserved in the reduced data set. Therefore, practical data reduction can be achieved by partitioning the overall data set via a coarse regular spatial grid, and compressing each grid cell individually by computing multivariate histograms or k-means clustering. Clustering spatial data in high dimensional spaces using k-means is expensive both with regard to computational costs and memory requirements. In a traditional k-means implementation all N data points belonging to a grid cell must be kept in memory to be clustered at a time, which often establishes a bottleneck for scientific data sets. Our objective is to define a clustering algorithm that scales automatically to any number of data points in a single grid cell, and provides high quality clustering results.
Keywords
data analysis; data reduction; statistical analysis; visual databases; data reduction; data stream; k-means clustering; massive data sets; multivariate histogram; satellite imagery data; scientific data analysis; spatial data clustering; spatial data grid; Clustering algorithms; Computer science; Data analysis; Geoscience; Grid computing; Image coding; Kelvin; Laboratories; Propulsion; Satellites;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Engineering, 2004. Proceedings. 20th International Conference on
ISSN
1063-6382
Print_ISBN
0-7695-2065-0
Type
conf
DOI
10.1109/ICDE.2004.1320061
Filename
1320061
Link To Document