DocumentCode :
1202676
Title :
Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging
Author :
Lin, Cheng-Ru ; Chen, Ming-Syan
Author_Institution :
Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei, Taiwan
Volume :
17
Issue :
2
fYear :
2005
Firstpage :
145
Lastpage :
159
Abstract :
Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids or the distance between two closest (or farthest) data points However, all of these measures are vulnerable to outliers and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measure, referred to as cohesion, to measure the intercluster distances. By using this new measure of cohesion, we have designed a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in time linear to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. The time and the space complexities of algorithm CSM are analyzed. As shown by our performance studies, the cohesion-based clustering is very robust and possesses excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently and provide better clustering results than those by prior methods.
Keywords :
computational complexity; data mining; pattern clustering; very large databases; cohesion-based self-merging; computational complexity; computational statistics; data clustering; data mining; data set; hierarchical algorithm; partitional algorithm; Algorithm design and analysis; Clustering algorithms; Clustering methods; Data mining; Partitioning algorithms; Robustness; Shape; Size measurement; Statistics; Time measurement;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2005.21
Filename :
1377168
Link To Document :
بازگشت