DocumentCode :
2944479
Title :
Spectral clustering in high-dimensions: Necessary and sufficient conditions for dense and sparse mixtures
Author :
Wainwright, Martin J.
Author_Institution :
Dept. of Stat., UC Berkeley, Berkeley, CA
fYear :
2008
fDate :
23-26 Sept. 2008
Firstpage :
191
Lastpage :
191
Abstract :
Loosely speaking, clustering refers to the problem of grouping data, and plays a central role in statistical machine learning and data analysis. One way in which to formalize the clustering problem is in terms of a mixture model, where the mixture components represent clusters within the data. We consider a semi-parametric formulation, in which a random vector X isin Ropfd is modeled as a distribution Fx(x)=Sigmaalpha=1 momegaalphaFalpha(x-mualpha). (1) with m components. Here omegaalpha isin (0, 1) is the weight on mixture component alpha. The mean vectors mualpha isin Ropfd are the (parametric) component of interest, whereas the dispersion distributions Falpha are a non-parametric nuisance component, on which we impose only tail conditions (e.g., sub-Gaussian or sub-exponential tail decay). Given n independent and identically distributed samples from the mixture model (1), we consider the problem of "learning" the mixture. More formally, for parameters (delta, epsi) isin (0, 1) times (0, 1), we say that a method (epsi, delta)-learns the mixture if it correctly determines the mixture label of all n samples with probability greater than 1 - delta, and estimates the mean vectors to accuracy epsi with high probability. This conference abstract provides an overview of the results in our full-length paper. We derive both necessary and sufficient conditions on the scaling of the sample size n as a function of the ambient dimension d, the minimum separation r(d) = minalphanebetaparmualpha - mubetapar2 between mixture components, and tail decay parameters. All of our analysis is high-dimensional in nature, meaning that we allow the sample size n, ambient dimension d and other parameters to scale in arbitrary ways. Our necessary conditions are information-theoretic in nature, and provide lower bounds on the performance of - - any algorithm, regardless of its computational complexity. Our sufficient conditions are based on analyzing a particular form of spectral clustering. For mixture models without any constraints on the mean vectors mualpha we show that standard spectral clustering-that is, based on sample means and covariance matrices-can achieve the information-theoretic limits. We also analyze mixture models in which the mean vectors are ldquosparse", and derive information-theoretic lower bounds. For such models, spectral clustering based on sample means/covariances is highly sub-optimal, but modified spectral clustering algorithms using thresholding estimators are nearly optimal.
Keywords :
covariance matrices; pattern clustering; sparse matrices; statistical analysis; covariance matrices; data analysis; data grouping; dispersion distributions; random vectors; sparse mixtures; spectral clustering; statistical machine learning; tail decay parameters; thresholding estimators; Clustering algorithms; Computational complexity; Covariance matrix; Data analysis; Information analysis; Machine learning; Probability distribution; Statistical analysis; Sufficient conditions; Tail;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Communication, Control, and Computing, 2008 46th Annual Allerton Conference on
Conference_Location :
Urbana-Champaign, IL
Print_ISBN :
978-1-4244-2925-7
Electronic_ISBN :
978-1-4244-2926-4
Type :
conf
DOI :
10.1109/ALLERTON.2008.4797554
Filename :
4797554
Link To Document :
بازگشت