Title :
Scalable model-based clustering by working on data summaries
Author :
Jin, Huidong ; Wong, Man-Leung ; Leung, Kwong-Sak
Author_Institution :
Dept. of Inf. Syst., Lingnan Univ., Tuen Mun, China
Abstract :
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. We present a two-phase scalable model-based clustering framework: first, a large data set is summed up into subclusters; Then, clusters are directly generated from the summary statistics of subclusters by a specifically designed expectation-maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each subcluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.
Keywords :
Gaussian processes; computational complexity; covariance analysis; data mining; pattern clustering; very large databases; EM; Gaussian mixture model; bEMADS; clustering system; covariance information; data mining; data summarization procedures; expectation-maximization algorithm; gEMADS; large databases; two-phase scalable model-based clustering; Algorithm design and analysis; Bridges; Clustering algorithms; Data mining; Databases; Explosives; Information systems; Iterative algorithms; Scalability; Statistics;
Conference_Titel :
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Print_ISBN :
0-7695-1978-4
DOI :
10.1109/ICDM.2003.1250907