DocumentCode :
3438046
Title :
Transform Residual K-Means Trees for Scalable Clustering
Author :
Jiangbo Yuan ; Xiuwen Liu
Author_Institution :
Dept. of Comput. Sci., Florida State Univ., Tallahassee, FL, USA
fYear :
2013
fDate :
7-10 Dec. 2013
Firstpage :
489
Lastpage :
496
Abstract :
The K-means problem, i.e., to partition a dataset into K clusters is a fundamental problem common to numerous data mining applications. As it is an NP-hard problem, iterative optimizations are typically used such as the K-means algorithm to compute cluster centers. As the K-means algorithm is exponential both in computation and storage in terms of required bits to encode cluster centers, approaches with low computation and storage complexity have been actively studied both for signal compression under real-time constraints and for clustering of large scale high-dimensional datasets. By generating cluster centers via Cartesian products of cluster centers in multiple groups or multiple stages, product and residual K-means trees reduce computation and storage complexity, making it possible to cluster large scale datasets. As residual K-means trees do not require assumed statistical independence among groups that are required by product K-means trees, they generally give better clustering quality. A known limitation of residual K-means trees is that the gain diminishes with added stages. In this paper, by identifying increased intrinsic dimensions of residual vectors with more stages as a limiting factor of performance due to the curse of dimensionality, we propose transform residual K-means trees as a generalization by applying cluster specific transforms to increase the correlations of residual vectors. Our methods substantially reduce the increase of intrinsic dimensions and therefore increase the effectiveness of multiple-stage residual Kmeans trees. Experimental and comparative results using widely used datasets show our methods give the best performance among all scalable methods. Furthermore, our methods enable effective super-clustering learning and provide scalable solutions to duplicate detection and other data mining problems.
Keywords :
computational complexity; data mining; generalisation (artificial intelligence); learning (artificial intelligence); pattern clustering; transforms; trees (mathematics); vectors; Cartesian products; K-means algorithm; K-means problem; NP-hard problem; cluster centers; cluster specific transforms; clustering quality; computation complexity; curse-of-dimensionality; data mining applications; dataset partitioning; duplicate detection; generalization; iterative optimizations; large scale high-dimensional datasets; product K-means trees; realtime constraints; residual K-means trees; residual vectors; scalable clustering; signal compression; statistical independence; storage complexity; super-clustering learning; Clustering algorithms; Complexity theory; Partitioning algorithms; Principal component analysis; Training; Transforms; Vectors; PCA alignment; Product quantization; k-means; resiudal vector quantization; scalable clustering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on
Conference_Location :
Dallas, TX
Print_ISBN :
978-1-4799-3143-9
Type :
conf
DOI :
10.1109/ICDMW.2013.110
Filename :
6753961
Link To Document :
بازگشت