Title :
Parallelizing clustering of geoscientific data sets using data streams
Author :
Nittel, Silvia ; Leung, Kelvin T.
Author_Institution :
Spatial Inf. Sci. & Eng., Maine Univ., Orono, ME, USA
Abstract :
Computing data mining algorithms such as clustering on massive geospatial data sets is still not feasible nor efficient today. In this paper, we introduce a k-means algorithm that is based on the data stream paradigm. The so-called partial/merge k-means algorithm is implemented as a set of data stream operators which are adaptable to available computing resources such as volatile memory and processing power. The partial data stream operator consumes as much data as can befit into RAM, and performs a weighted k-means on the data subset. Subsequently, the weighted partial results are merged by a second data stream operator. All operators can be cloned, and parallelized. In our analytical and experimental performance evaluation, we demonstrate that the partial/merge k-means can outperform a one-step algorithm by a large margin with regard to overall computation time and clustering quality with increasing data density per grid cell.
Keywords :
data mining; geophysics computing; pattern clustering; scientific information systems; RAM; clustering parallelization; computation time; data density; data mining algorithms; data set clustering; data stream paradigm; data subset; geoscientific data sets; grid cell; massive geospatial data sets; one-step algorithm; partial data stream operator; partial merge k-means algorithm; performance evaluation; processing power; volatile memory; weighted k-means; Algorithm design and analysis; Clustering algorithms; Data mining; Earth; Grid computing; Histograms; Image coding; Instruments; Satellites; Scalability;
Conference_Titel :
Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
Print_ISBN :
0-7695-2146-0
DOI :
10.1109/SSDM.2004.1311195