• DocumentCode
    2849883
  • Title

    Fast and exact out-of-core k-means clustering

  • Author

    Goswami, Anjan ; Jin, Ruoming ; Agrawal, Gagan

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2004
  • fDate
    1-4 Nov. 2004
  • Firstpage
    83
  • Lastpage
    90
  • Abstract
    Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means.
  • Keywords
    data mining; pattern clustering; cluster centers; data mining; k-means algorithm; k-means clustering; large disk-resident datasets; real datasets; synthetic datasets; Algorithm design and analysis; Clustering algorithms; Computer science; Convergence; Data engineering; Data mining; Databases; Pattern recognition; Sampling methods; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
  • Print_ISBN
    0-7695-2142-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2004.10102
  • Filename
    1410270