DocumentCode :
3588628
Title :
Scaling and analyzing the stencil performance on multi-core and many-core architectures
Author :
Lin Gan ; Haohuan Fu ; Wei Xue ; Yangtong Xu ; Chao Yang ; Xinliang Wang ; Zihong Lv ; Yang You ; Guangwen Yang ; Kaijian Ou
Author_Institution :
Minist. of Educ. Key Lab. for Earth Syst. Modeling, Tsinghua Univ., Beijing, China
fYear :
2014
Firstpage :
103
Lastpage :
110
Abstract :
Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler´s inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.
Keywords :
graphics processing units; multiprocessing systems; program compilers; Intel Sandy Bridge processor; Intel Xeon Phi coprocessor; Kepler K20x GPU; NVIDIA Fermi C2070 GPU; WNAD stencil; compiler; graphics processing unit; many-core architecture; multicore architecture; multithreading; numerical simulation; optimization; seven-point Jacobi stencil; stencil performance; vectorization; Computer architecture; Graphics processing units; Instruction sets; Kernel; Microwave integrated circuits; Optimization; Registers; Many-core architecture; Multi-core architecture; Optimizations; Stencil;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2014 20th IEEE International Conference on
Type :
conf
DOI :
10.1109/PADSW.2014.7097797
Filename :
7097797
Link To Document :
بازگشت