مرکز منطقه ای اطلاع رساني علوم و فناوري - Transform Residual K-Means Trees for Scalable Clustering

DocumentCode :

3438046

Title :

Transform Residual K-Means Trees for Scalable Clustering

Author :

Jiangbo Yuan ; Xiuwen Liu

Author_Institution :

Dept. of Comput. Sci., Florida State Univ., Tallahassee, FL, USA

fYear :

2013

fDate :

7-10 Dec. 2013

Firstpage :

489

Lastpage :

496

Abstract :

The K-means problem, i.e., to partition a dataset into K clusters is a fundamental problem common to numerous data mining applications. As it is an NP-hard problem, iterative optimizations are typically used such as the K-means algorithm to compute cluster centers. As the K-means algorithm is exponential both in computation and storage in terms of required bits to encode cluster centers, approaches with low computation and storage complexity have been actively studied both for signal compression under real-time constraints and for clustering of large scale high-dimensional datasets. By generating cluster centers via Cartesian products of cluster centers in multiple groups or multiple stages, product and residual K-means trees reduce computation and storage complexity, making it possible to cluster large scale datasets. As residual K-means trees do not require assumed statistical independence among groups that are required by product K-means trees, they generally give better clustering quality. A known limitation of residual K-means trees is that the gain diminishes with added stages. In this paper, by identifying increased intrinsic dimensions of residual vectors with more stages as a limiting factor of performance due to the curse of dimensionality, we propose transform residual K-means trees as a generalization by applying cluster specific transforms to increase the correlations of residual vectors. Our methods substantially reduce the increase of intrinsic dimensions and therefore increase the effectiveness of multiple-stage residual Kmeans trees. Experimental and comparative results using widely used datasets show our methods give the best performance among all scalable methods. Furthermore, our methods enable effective super-clustering learning and provide scalable solutions to duplicate detection and other data mining problems.

Keywords :

computational complexity; data mining; generalisation (artificial intelligence); learning (artificial intelligence); pattern clustering; transforms; trees (mathematics); vectors; Cartesian products; K-means algorithm; K-means problem; NP-hard problem; cluster centers; cluster specific transforms; clustering quality; computation complexity; curse-of-dimensionality; data mining applications; dataset partitioning; duplicate detection; generalization; iterative optimizations; large scale high-dimensional datasets; product K-means trees; realtime constraints; residual K-means trees; residual vectors; scalable clustering; signal compression; statistical independence; storage complexity; super-clustering learning; Clustering algorithms; Complexity theory; Partitioning algorithms; Principal component analysis; Training; Transforms; Vectors; PCA alignment; Product quantization; k-means; resiudal vector quantization; scalable clustering;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on

Conference_Location :

Dallas, TX

Print_ISBN :

978-1-4799-3143-9

Type :

conf

DOI :

10.1109/ICDMW.2013.110

Filename :

6753961

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3438046