مرکز منطقه ای اطلاع رساني علوم و فناوري - Highly scalable implementation of an image-body code on a GPU cluster Original Research Article

Title of article :

Highly scalable implementation of an image-body code on a GPU cluster Original Research Article

Author/Authors :

Yohei Miki، نويسنده , , DAISUKE TAKAHASHI، نويسنده , , Masao Mori، نويسنده ,

Issue Information :

ماهنامه با شماره پیاپی سال 2013

Pages :

From page :

2159

To page :

2168

Abstract :

We have developed a highly optimized code for collisionless image-body calculations based on direct summation. Our new optimization hides the global memory access latency, and the resulting CUDA code has a peak performance of 1006.7 GFlop/s in single precision (assuming 26 floating-point operations per interaction) with a single NVIDIA Tesla M2090 board. To improve the scalability of the OpenMP/MPI hybrid parallelized code, we have reduced the number of communications among multiple GPUs and have overlapped communications with computations to hide communication time. The code’s performance was measured on the HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences), a recently installed GPGPU cluster at University of Tsukuba. The results show excellent scalability with superlinear scaling when the number of image-body particles per GPU is less than 104 and parallel efficiency approaching unity when the number of image-body particles per GPU is greater than 104. The CUDA/OpenMP/MPI code has a peak performance of 255.5 TFlop/s when 256 NVIDIA Tesla M2090 boards are used, which is 75.0% of the theoretical peak performance.

Keywords :

Performance modeling , OpenMP/MPI , CUDA , Performance tuning , GPGPU , N-body calculation

Journal title :

Computer Physics Communications

Serial Year :

2013

Journal title :

Computer Physics Communications

Record number :

1136633

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=1136633