DocumentCode :
3134002
Title :
Parallelizing clustering of geoscientific data sets using data streams
Author :
Nittel, Silvia ; Leung, Kelvin T.
Author_Institution :
Spatial Inf. Sci. & Eng., Maine Univ., Orono, ME, USA
fYear :
2004
fDate :
21-23 June 2004
Firstpage :
73
Lastpage :
84
Abstract :
Computing data mining algorithms such as clustering on massive geospatial data sets is still not feasible nor efficient today. In this paper, we introduce a k-means algorithm that is based on the data stream paradigm. The so-called partial/merge k-means algorithm is implemented as a set of data stream operators which are adaptable to available computing resources such as volatile memory and processing power. The partial data stream operator consumes as much data as can befit into RAM, and performs a weighted k-means on the data subset. Subsequently, the weighted partial results are merged by a second data stream operator. All operators can be cloned, and parallelized. In our analytical and experimental performance evaluation, we demonstrate that the partial/merge k-means can outperform a one-step algorithm by a large margin with regard to overall computation time and clustering quality with increasing data density per grid cell.
Keywords :
data mining; geophysics computing; pattern clustering; scientific information systems; RAM; clustering parallelization; computation time; data density; data mining algorithms; data set clustering; data stream paradigm; data subset; geoscientific data sets; grid cell; massive geospatial data sets; one-step algorithm; partial data stream operator; partial merge k-means algorithm; performance evaluation; processing power; volatile memory; weighted k-means; Algorithm design and analysis; Clustering algorithms; Data mining; Earth; Grid computing; Histograms; Image coding; Instruments; Satellites; Scalability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
ISSN :
1099-3371
Print_ISBN :
0-7695-2146-0
Type :
conf
DOI :
10.1109/SSDM.2004.1311195
Filename :
1311195
Link To Document :
بازگشت