Title :
Design and Performance of a Scalable, Parallel Statistics Toolkit
Author :
Pébay, Philippe ; Thompson, David ; Bennett, Janine ; Mascarenhas, Ajith
Author_Institution :
Sandia Nat. Labs., Livermore, CA, USA
Abstract :
Most statistical software packages implement a broad range of techniques but do so in an ad hoc fashion, leaving users who do not have a broad knowledge of statistics at a disadvantage since they may not understand all the implications of a given analysis or how to test the validity of results. These packages are also largely serial in nature, or target multicore architectures instead of distributed-memory systems, or provide only a small number of statistics in parallel. This paper surveys a collection of parallel implementations of statistics algorithm developed as part of a common framework over the last 3 years. The framework strategically groups modeling techniques with associated verification and validation techniques to make the underlying assumptions of the statistics more clear. Furthermore it employs a design pattern specifically targeted for distributed-memory parallelism, where architectural advances in large-scale high-performance computing have been focused. Moment-based statistics (which include descriptive, correlative, and multicorrelative statistics, principal component analysis (PCA), and k-means statistics) scale nearly linearly with the data set size and number of processes. Entropy-based statistics (which include order and contingency statistics) do not scale well when the data in question is continuous or quasi-diffuse but do scale well when the data is discrete and compact. We confirm and extend our earlier results by now establishing near-optimal scalability with up to 10,000 processes.
Keywords :
distributed memory systems; formal specification; mathematics computing; object-oriented programming; parallel programming; principal component analysis; program verification; software packages; statistical analysis; contingency statistics; descriptive statistics; design pattern; distributed-memory parallelism; entropy-based statistics; k-means statistics; large-scale high-performance computing; modeling technique; moment-based statistics; multicore architecture; multicorrelative statistics; parallel implementation; principal component analysis; scalable parallel statistics toolkit; statistical software package; statistics algorithm; validation technique; verification technique; Algorithm design and analysis; Computational modeling; Data models; Parallel processing; Principal component analysis; Scalability;
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2011.293