• DocumentCode
    3134002
  • Title

    Parallelizing clustering of geoscientific data sets using data streams

  • Author

    Nittel, Silvia ; Leung, Kelvin T.

  • Author_Institution
    Spatial Inf. Sci. & Eng., Maine Univ., Orono, ME, USA
  • fYear
    2004
  • fDate
    21-23 June 2004
  • Firstpage
    73
  • Lastpage
    84
  • Abstract
    Computing data mining algorithms such as clustering on massive geospatial data sets is still not feasible nor efficient today. In this paper, we introduce a k-means algorithm that is based on the data stream paradigm. The so-called partial/merge k-means algorithm is implemented as a set of data stream operators which are adaptable to available computing resources such as volatile memory and processing power. The partial data stream operator consumes as much data as can befit into RAM, and performs a weighted k-means on the data subset. Subsequently, the weighted partial results are merged by a second data stream operator. All operators can be cloned, and parallelized. In our analytical and experimental performance evaluation, we demonstrate that the partial/merge k-means can outperform a one-step algorithm by a large margin with regard to overall computation time and clustering quality with increasing data density per grid cell.
  • Keywords
    data mining; geophysics computing; pattern clustering; scientific information systems; RAM; clustering parallelization; computation time; data density; data mining algorithms; data set clustering; data stream paradigm; data subset; geoscientific data sets; grid cell; massive geospatial data sets; one-step algorithm; partial data stream operator; partial merge k-means algorithm; performance evaluation; processing power; volatile memory; weighted k-means; Algorithm design and analysis; Clustering algorithms; Data mining; Earth; Grid computing; Histograms; Image coding; Instruments; Satellites; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
  • ISSN
    1099-3371
  • Print_ISBN
    0-7695-2146-0
  • Type

    conf

  • DOI
    10.1109/SSDM.2004.1311195
  • Filename
    1311195