مرکز منطقه ای اطلاع رساني علوم و فناوري - Scaling and analyzing the stencil performance on multi-core and many-core architectures

DocumentCode :

3588628

Title :

Scaling and analyzing the stencil performance on multi-core and many-core architectures

Author :

Lin Gan ; Haohuan Fu ; Wei Xue ; Yangtong Xu ; Chao Yang ; Xinliang Wang ; Zihong Lv ; Yang You ; Guangwen Yang ; Kaijian Ou

Author_Institution :

Minist. of Educ. Key Lab. for Earth Syst. Modeling, Tsinghua Univ., Beijing, China

fYear :

2014

Firstpage :

103

Lastpage :

110

Abstract :

Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler´s inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.

Keywords :

graphics processing units; multiprocessing systems; program compilers; Intel Sandy Bridge processor; Intel Xeon Phi coprocessor; Kepler K20x GPU; NVIDIA Fermi C2070 GPU; WNAD stencil; compiler; graphics processing unit; many-core architecture; multicore architecture; multithreading; numerical simulation; optimization; seven-point Jacobi stencil; stencil performance; vectorization; Computer architecture; Graphics processing units; Instruction sets; Kernel; Microwave integrated circuits; Optimization; Registers; Many-core architecture; Multi-core architecture; Optimizations; Stencil;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Systems (ICPADS), 2014 20th IEEE International Conference on

Type :

conf

DOI :

10.1109/PADSW.2014.7097797

Filename :

7097797

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3588628