DocumentCode :
2549110
Title :
Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases
Author :
Pébay, Philippe ; Thompson, David ; Bennett, Janine
Author_Institution :
Sandia Nat. Labs., Livermore, CA, USA
fYear :
2010
fDate :
20-24 Sept. 2010
Firstpage :
156
Lastpage :
165
Abstract :
Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and X2 independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.
Keywords :
probability; statistical analysis; contingency statistics; contingency table; information entropy; marginal probability; moment-based statistics; numerical stability; parallel scalability; point-wise mutual information; statistical analysis; Algorithm design and analysis; Information entropy; Joints; Mutual information; Random variables; Scalability; Statistical analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2010 IEEE International Conference on
Conference_Location :
Heraklion, Crete
Print_ISBN :
978-1-4244-8373-0
Electronic_ISBN :
978-0-7695-4220-1
Type :
conf
DOI :
10.1109/CLUSTER.2010.43
Filename :
5600310
Link To Document :
بازگشت