Title :
Space Partitioning for Scalable K-Means
Author :
Pettinger, David ; Di Fatta, Giuseppe
Author_Institution :
Sch. of Syst. Eng., Univ. of Reading, Reading, UK
Abstract :
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force´ since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force´ algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.
Keywords :
pattern clustering; tree data structures; tree searching; KD-tree; approximate hierarchical clustering; binary space partitioning k-means; binary space partitioning trees; brute force algorithm; centroid; data partition; data point; data structure; distance computation; geometrical constraint; iterative refinement procedure; multidimensional binary search tree; scalable k-means clustering algorithm; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Data structures; Force; Indexes; Partitioning algorithms; Clustering; K-Means; Space Partitioning Trees;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on
Conference_Location :
Washington, DC
Print_ISBN :
978-1-4244-9211-4
DOI :
10.1109/ICMLA.2010.46