مرکز منطقه ای اطلاع رساني علوم و فناوري - Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

DocumentCode :

560162

Title :

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Author :

Williams, Samuel ; Oliker, Leonid ; Carter, Jonathan ; Shalf, John

Author_Institution :

Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA

fYear :

2011

fDate :

12-18 Nov. 2011

Firstpage :

Lastpage :

Abstract :

We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4x using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems.

Keywords :

Cray computers; application program interfaces; grid computing; lattice Boltzmann methods; magnetohydrodynamics; message passing; multi-threading; multiprocessing systems; parallel architectures; Cray XE6; HPC node architectures; HPC systems; IBM BlueGene/P platforms; ISA-specific intrinsics; OpenMP; Pthreads; communication bottlenecks; cooling constraints; distributed auto-tuning; energy requirements; flat MPI; grid-based lattice Boltzmann computation; hierarchical auto-tuning; hierarchical tuning techniques; homogeneous isotropic turbulence; large-scale simulations; loop transformations; magnetohydrodynamics; microprocessor clock speeds; multicore-based supercomputing platforms; on-chip parallelism; parallel optimization; portable optimization methodology; power constraints; programming model exploration; sophisticated sequential auto-tuning techniques; state-of-the-art Cray XT4; thread decomposition strategy; tuning approach; ultrascale lattice Boltzmann performance; virtual vectorization; Distribution functions; Lattices; Multicore processing; Optimization; Three dimensional displays; Tuning; Vectors; Auto-tuning; BlueGene; Hybrid Programming Models; Lattice Boltzmann; OpenMP; SIMD;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for

Conference_Location :

Seatle, WA

Electronic_ISBN :

978-1-4503-0771-0

Type :

conf

Filename :

6114428

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=560162