Title :
Scalable clustering: a distributed approach
Author :
More, P. ; Hall, Lawrence O.
Author_Institution :
Dept. of Comput. Sci. & Eng., South Florida Univ., Tampa, FL, USA
Abstract :
The ever-increasing size of data sets and poor scalability of clustering algorithms has drawn attention to distributed clustering for partitioning large data sets. In this paper we propose an algorithm to cluster large-scale data sets without clustering all the data at a time. Data is randomly divided into almost equal size disjoint subsets. We then cluster each subset using the hard-k means or fuzzy k-means algorithm. The centroids of subsets form an ensemble. A centroid correspondence algorithm transitively solves the correspondence problem among the ensemble of centroids. The centroids are combined to form a global set of centroids. Experimental results show that most of the time the pattern of clusters generated by our algorithm is similar to the pattern of clusters generated by clustering all the data at a time. We have shown that the disputed examples between the clusters generated by our algorithm and clustering all the data at a time lay on the spatial border of clusters.
Keywords :
fuzzy set theory; pattern clustering; very large databases; fuzzy k-means algorithm; large-scale data sets; scalable clustering; Clustering algorithms; Computer science; Data engineering; Euclidean distance; Fuzzy logic; Fuzzy sets; Iterative algorithms; Partitioning algorithms; Scalability; Testing;
Conference_Titel :
Fuzzy Systems, 2004. Proceedings. 2004 IEEE International Conference on
Print_ISBN :
0-7803-8353-2
DOI :
10.1109/FUZZY.2004.1375705