مرکز منطقه ای اطلاع رساني علوم و فناوري - Design and Performance of a Scalable, Parallel Statistics Toolkit

DocumentCode :

3145732

Title :

Design and Performance of a Scalable, Parallel Statistics Toolkit

Author :

Pébay, Philippe ; Thompson, David ; Bennett, Janine ; Mascarenhas, Ajith

Author_Institution :

Sandia Nat. Labs., Livermore, CA, USA

fYear :

2011

fDate :

16-20 May 2011

Firstpage :

1475

Lastpage :

1484

Abstract :

Most statistical software packages implement a broad range of techniques but do so in an ad hoc fashion, leaving users who do not have a broad knowledge of statistics at a disadvantage since they may not understand all the implications of a given analysis or how to test the validity of results. These packages are also largely serial in nature, or target multicore architectures instead of distributed-memory systems, or provide only a small number of statistics in parallel. This paper surveys a collection of parallel implementations of statistics algorithm developed as part of a common framework over the last 3 years. The framework strategically groups modeling techniques with associated verification and validation techniques to make the underlying assumptions of the statistics more clear. Furthermore it employs a design pattern specifically targeted for distributed-memory parallelism, where architectural advances in large-scale high-performance computing have been focused. Moment-based statistics (which include descriptive, correlative, and multicorrelative statistics, principal component analysis (PCA), and k-means statistics) scale nearly linearly with the data set size and number of processes. Entropy-based statistics (which include order and contingency statistics) do not scale well when the data in question is continuous or quasi-diffuse but do scale well when the data is discrete and compact. We confirm and extend our earlier results by now establishing near-optimal scalability with up to 10,000 processes.

Keywords :

distributed memory systems; formal specification; mathematics computing; object-oriented programming; parallel programming; principal component analysis; program verification; software packages; statistical analysis; contingency statistics; descriptive statistics; design pattern; distributed-memory parallelism; entropy-based statistics; k-means statistics; large-scale high-performance computing; modeling technique; moment-based statistics; multicore architecture; multicorrelative statistics; parallel implementation; principal component analysis; scalable parallel statistics toolkit; statistical software package; statistics algorithm; validation technique; verification technique; Algorithm design and analysis; Computational modeling; Data models; Parallel processing; Principal component analysis; Scalability;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on

Conference_Location :

Shanghai

ISSN :

1530-2075

Print_ISBN :

978-1-61284-425-1

Electronic_ISBN :

1530-2075

Type :

conf

DOI :

10.1109/IPDPS.2011.293

Filename :

6009003

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3145732