DocumentCode :
625652
Title :
Extending OpenSHMEM for GPU Computing
Author :
Potluri, Sreeram ; Bureddy, D. ; Wang, Huifang ; Subramoni, Hari ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2013
fDate :
20-24 May 2013
Firstpage :
1001
Lastpage :
1012
Abstract :
Graphics Processing Units (GPUs) are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. In order to maximize utilization, it is imperative that applications running on these clusters have low synchronization and communication overheads. Partitioned Global Address Space (PGAS) models provide an attractive approach for developing parallel scientific applications. Such models simplify programming through the abstraction of a shared memory address space while their one-sided communication primitives allow for efficient implementation of applications with minimum synchronization. OpenSHMEM is a library-based programming model that is gaining popularity. However, the current OpenSHMEM standard does not support direct communication from GPU device buffers. It requires data to be copied to the host memory before OpenSHMEM calls can be made. Similarly, data has to moved to the GPU explicitly by remote processes. This severely limits the programmability and performance of GPU applications. In this paper we provide extensions to the OpenSHMEM model which allow communication calls to be made directly on the GPU memory. The proposed extensions are interoperable with the two most popular GPU programming frameworks: CUDA and OpenCL. We present designs for an efficient OpenSHMEM runtime which transparently provide high-performance communication between GPUs in different inter-node and intra-node configurations. To the best of our knowledge this is the first work that enables GPU-GPU communication using the OpenSHMEM model for both CUDA and OpenCL computing frameworks. The proposed extensions to OpenSHMEM, coupled with the high-performance runtime, improve the latency of GPU-GPU shmem getmem operation by 90%, 40% and 17%, for intra-IOH (I/O Hub), inter-IOH and inter-node configurations. It improves the performance of OpenSHMEM atomics by up to 55% and 52%, for intra-IOH and inter-node GPU configurations respec- ively. The proposed enhancements improve the performance of Stencil2D kernel by 65% on a cluster of 192 GPUs and the performance of BFS kernel by 12% on a cluster of 96 GPUs.
Keywords :
graphics processing units; parallel architectures; parallel machines; shared memory systems; CUDA; GPU computing; GPU device buffers; GPU memory; GPU programming frameworks; GPU-GPU shmem getmem operation; I/O hub; OpenCL; OpenCL computing frameworks; OpenSHMEM atomics; OpenSHMEM model; OpenSHMEM runtime; PGAS models; Stencil2D kernel; abstraction; communication overheads; direct communication; graphics processing units; high compute density; high-performance communication; high-performance runtime; host memory; inter-node configurations; interoperable; intra-IOH; intra-node configurations; library-based programming model; minimum synchronization; modern supercomputer architectures; one-sided communication primitives; parallel scientific applications; partitioned global address space models; performance per watt; programmability; shared memory address space; Computational modeling; Context; Electronics packaging; Graphics processing units; Kernel; Programming; Runtime; CUDA; GPU; OpenCL; OpenSHMEM; PGAS;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
Conference_Location :
Boston, MA
ISSN :
1530-2075
Print_ISBN :
978-1-4673-6066-1
Type :
conf
DOI :
10.1109/IPDPS.2013.104
Filename :
6569880
Link To Document :
بازگشت