Title :
Scalable Clustering for Large High-Dimensional Data Based on Data Summarization
Author :
Lai, Ying ; Orlandic, Ratko ; Yee, Wai Gen ; Kulkarni, Sachin
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
fDate :
March 1 2007-April 5 2007
Abstract :
Clustering large data sets with high dimensionality is a challenging data-mining task. This paper presents a framework to perform such a task efficiently. It is based on the notion of data space reduction, which finds high density areas, or dense cells, in the given feature space. The dense cells store summarized information of the data. A designated partitioning or hierarchical clustering algorithm can be used as the second step to find clusters based on the data summaries. Using Kmeans as an example, this paper presents GARDEN-Kmeans, which performs data space reduction using Gamma Region DENsity partition, and utilizes Kmeans to cluster the summarized information. The experimental study shows that GARDEN-Kmeans executes several orders of magnitude faster than basic Kmeans and the recursive bisection Kmeans algorithm of CLUTO, while producing comparable clustering quality
Keywords :
data handling; data mining; pattern clustering; CLUTO; GARDEN-Kmeans; Gamma Region DENsity partition; data mining; data space reduction; data summaries; data summarization; hierarchical clustering algorithm; high dimensionality; large data set clustering; large high-dimensional data; partitioning algorithm; recursive bisection Kmeans algorithm; scalable clustering; Algorithm design and analysis; Clustering algorithms; Clustering methods; Computational intelligence; Computer science; Data mining; Iterative algorithms; Partitioning algorithms; Sampling methods; Space technology;
Conference_Titel :
Computational Intelligence and Data Mining, 2007. CIDM 2007. IEEE Symposium on
Conference_Location :
Honolulu, HI
Print_ISBN :
1-4244-0705-2
DOI :
10.1109/CIDM.2007.368910