DocumentCode :
549188
Title :
A local dependence measure and its application to screening for high correlations in large data sets
Author :
Sricharan, Kumar ; Hero, Alfred O., III ; Rajaratnam, Bala
Author_Institution :
Dept. of EECS, Univ. of Michigan, Ann Arbor, MI, USA
fYear :
2011
fDate :
5-8 July 2011
Firstpage :
1
Lastpage :
8
Abstract :
Correlation screening is frequently the only practical way to discover dependencies in very high dimensional data. In correlation screening a high threshold is applied to the matrix of sample correlation coefficients of the multivariate data. The variables having coefficients that exceed the threshold are called discoveries and are classified to be dependent. The mean number of discoveries and the number of false discoveries in correlation screening problems depend on a information-theoretic measure J, a novel type of information divergence that is a function of the joint density of pairs of variables. It is therefore important to estimate J in order to determine screening thresholds for desired false alarm rates. In this paper, we propose a kernel estimator for J, establish asymptotic consistency and determine the asymptotic distribution of the estimator. These results are used to minimize the MSE of the estimator and to determine confidence intervals on J. We use these results to test for dependence between variables in both simulated data sets and also between email spam harvesters. Finally, we use the estimate of J to determine screening thresholds in correlation screening problems involving gene expression data.
Keywords :
correlation methods; data analysis; estimation theory; genetic algorithms; information theory; mean square error methods; MSE; asymptotic consistency; asymptotic distribution; confidence intervals; correlation screening problems; desired false alarm rates; dimensional data; email spam harvesters; false discovery; gene expression data; information divergence; information-theoretic measure; joint density; kernel estimator; large data sets; local dependence measure; multivariate data; sample correlation coefficients; screening thresholds; simulated data sets; Correlation; Covariance matrix; Electronic mail; Estimation; Gaussian distribution; Joints; Random variables; CLT; Dependence measure; Information theory; correlation screening; estimation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Fusion (FUSION), 2011 Proceedings of the 14th International Conference on
Conference_Location :
Chicago, IL
Print_ISBN :
978-1-4577-0267-9
Type :
conf
Filename :
5977629
Link To Document :
بازگشت