DocumentCode :
560162
Title :
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
Author :
Williams, Samuel ; Oliker, Leonid ; Carter, Jonathan ; Shalf, John
Author_Institution :
Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
fYear :
2011
fDate :
12-18 Nov. 2011
Firstpage :
1
Lastpage :
12
Abstract :
We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4x using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems.
Keywords :
Cray computers; application program interfaces; grid computing; lattice Boltzmann methods; magnetohydrodynamics; message passing; multi-threading; multiprocessing systems; parallel architectures; Cray XE6; HPC node architectures; HPC systems; IBM BlueGene/P platforms; ISA-specific intrinsics; OpenMP; Pthreads; communication bottlenecks; cooling constraints; distributed auto-tuning; energy requirements; flat MPI; grid-based lattice Boltzmann computation; hierarchical auto-tuning; hierarchical tuning techniques; homogeneous isotropic turbulence; large-scale simulations; loop transformations; magnetohydrodynamics; microprocessor clock speeds; multicore-based supercomputing platforms; on-chip parallelism; parallel optimization; portable optimization methodology; power constraints; programming model exploration; sophisticated sequential auto-tuning techniques; state-of-the-art Cray XT4; thread decomposition strategy; tuning approach; ultrascale lattice Boltzmann performance; virtual vectorization; Distribution functions; Lattices; Multicore processing; Optimization; Three dimensional displays; Tuning; Vectors; Auto-tuning; BlueGene; Hybrid Programming Models; Lattice Boltzmann; OpenMP; SIMD;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Conference_Location :
Seatle, WA
Electronic_ISBN :
978-1-4503-0771-0
Type :
conf
Filename :
6114428
Link To Document :
بازگشت