DocumentCode
3134002
Title
Parallelizing clustering of geoscientific data sets using data streams
Author
Nittel, Silvia ; Leung, Kelvin T.
Author_Institution
Spatial Inf. Sci. & Eng., Maine Univ., Orono, ME, USA
fYear
2004
fDate
21-23 June 2004
Firstpage
73
Lastpage
84
Abstract
Computing data mining algorithms such as clustering on massive geospatial data sets is still not feasible nor efficient today. In this paper, we introduce a k-means algorithm that is based on the data stream paradigm. The so-called partial/merge k-means algorithm is implemented as a set of data stream operators which are adaptable to available computing resources such as volatile memory and processing power. The partial data stream operator consumes as much data as can befit into RAM, and performs a weighted k-means on the data subset. Subsequently, the weighted partial results are merged by a second data stream operator. All operators can be cloned, and parallelized. In our analytical and experimental performance evaluation, we demonstrate that the partial/merge k-means can outperform a one-step algorithm by a large margin with regard to overall computation time and clustering quality with increasing data density per grid cell.
Keywords
data mining; geophysics computing; pattern clustering; scientific information systems; RAM; clustering parallelization; computation time; data density; data mining algorithms; data set clustering; data stream paradigm; data subset; geoscientific data sets; grid cell; massive geospatial data sets; one-step algorithm; partial data stream operator; partial merge k-means algorithm; performance evaluation; processing power; volatile memory; weighted k-means; Algorithm design and analysis; Clustering algorithms; Data mining; Earth; Grid computing; Histograms; Image coding; Instruments; Satellites; Scalability;
fLanguage
English
Publisher
ieee
Conference_Titel
Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
ISSN
1099-3371
Print_ISBN
0-7695-2146-0
Type
conf
DOI
10.1109/SSDM.2004.1311195
Filename
1311195
Link To Document