Author/Authors :
Brito، نويسنده , , M.R. and Quiroz، نويسنده , , A.J. and Yukich، نويسنده , , J.E.، نويسنده ,
Abstract :
Three graph theoretical statistics are considered for the problem of estimating the intrinsic dimension of a data set. The first is the “reach” statistic, r ¯ j , k , proposed in Brito et al. (2002) [4] for the problem of identification of Euclidean dimension. The second, M n , is the sample average of squared degrees in the minimum spanning tree of the data, while the third statistic, U n k , is based on counting the number of common neighbors among the k -nearest, for each pair of sample points { X i , X j } , i < j ≤ n . For the first and third of these statistics, central limit theorems are proved under general assumptions, for data living in an m -dimensional C 1 submanifold of R d , and in this setting, we establish the consistency of intrinsic dimension identification procedures based on r ¯ j , k and U n k . For M n , asymptotic results are provided whenever data live in an affine subspace of Euclidean space. The graph theoretical methods proposed are compared, via simulations, with a host of recently proposed nearest neighbor alternatives.
Keywords :
intrinsic dimension , Graph theoretical methods , Dimensionality reduction , Stabilization methods