مرکز منطقه ای اطلاع رساني علوم و فناوري - Spectral clustering in high-dimensions: Necessary and sufficient conditions for dense and sparse mixtures

Abstract :

Loosely speaking, clustering refers to the problem of grouping data, and plays a central role in statistical machine learning and data analysis. One way in which to formalize the clustering problem is in terms of a mixture model, where the mixture components represent clusters within the data. We consider a semi-parametric formulation, in which a random vector X isin Ropf^d is modeled as a distribution Fx(x)=Sigma_alpha=1 ^momega_alphaF_alpha(x-mu_alpha). (1) with m components. Here omega_alpha isin (0, 1) is the weight on mixture component alpha. The mean vectors mu_alpha isin Ropf^d are the (parametric) component of interest, whereas the dispersion distributions F_alpha are a non-parametric nuisance component, on which we impose only tail conditions (e.g., sub-Gaussian or sub-exponential tail decay). Given n independent and identically distributed samples from the mixture model (1), we consider the problem of "learning" the mixture. More formally, for parameters (delta, epsi) isin (0, 1) times (0, 1), we say that a method (epsi, delta)-learns the mixture if it correctly determines the mixture label of all n samples with probability greater than 1 - delta, and estimates the mean vectors to accuracy epsi with high probability. This conference abstract provides an overview of the results in our full-length paper. We derive both necessary and sufficient conditions on the scaling of the sample size n as a function of the ambient dimension d, the minimum separation r(d) = min_alphanebetaparmu_alpha - mu_betapar₂ between mixture components, and tail decay parameters. All of our analysis is high-dimensional in nature, meaning that we allow the sample size n, ambient dimension d and other parameters to scale in arbitrary ways. Our necessary conditions are information-theoretic in nature, and provide lower bounds on the performance of - - any algorithm, regardless of its computational complexity. Our sufficient conditions are based on analyzing a particular form of spectral clustering. For mixture models without any constraints on the mean vectors mu_alpha we show that standard spectral clustering-that is, based on sample means and covariance matrices-can achieve the information-theoretic limits. We also analyze mixture models in which the mean vectors are ldquosparse", and derive information-theoretic lower bounds. For such models, spectral clustering based on sample means/covariances is highly sub-optimal, but modified spectral clustering algorithms using thresholding estimators are nearly optimal.