• DocumentCode
    625652
  • Title

    Extending OpenSHMEM for GPU Computing

  • Author

    Potluri, Sreeram ; Bureddy, D. ; Wang, Huifang ; Subramoni, Hari ; Panda, Dhabaleswar K.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    1001
  • Lastpage
    1012
  • Abstract
    Graphics Processing Units (GPUs) are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. In order to maximize utilization, it is imperative that applications running on these clusters have low synchronization and communication overheads. Partitioned Global Address Space (PGAS) models provide an attractive approach for developing parallel scientific applications. Such models simplify programming through the abstraction of a shared memory address space while their one-sided communication primitives allow for efficient implementation of applications with minimum synchronization. OpenSHMEM is a library-based programming model that is gaining popularity. However, the current OpenSHMEM standard does not support direct communication from GPU device buffers. It requires data to be copied to the host memory before OpenSHMEM calls can be made. Similarly, data has to moved to the GPU explicitly by remote processes. This severely limits the programmability and performance of GPU applications. In this paper we provide extensions to the OpenSHMEM model which allow communication calls to be made directly on the GPU memory. The proposed extensions are interoperable with the two most popular GPU programming frameworks: CUDA and OpenCL. We present designs for an efficient OpenSHMEM runtime which transparently provide high-performance communication between GPUs in different inter-node and intra-node configurations. To the best of our knowledge this is the first work that enables GPU-GPU communication using the OpenSHMEM model for both CUDA and OpenCL computing frameworks. The proposed extensions to OpenSHMEM, coupled with the high-performance runtime, improve the latency of GPU-GPU shmem getmem operation by 90%, 40% and 17%, for intra-IOH (I/O Hub), inter-IOH and inter-node configurations. It improves the performance of OpenSHMEM atomics by up to 55% and 52%, for intra-IOH and inter-node GPU configurations respec- ively. The proposed enhancements improve the performance of Stencil2D kernel by 65% on a cluster of 192 GPUs and the performance of BFS kernel by 12% on a cluster of 96 GPUs.
  • Keywords
    graphics processing units; parallel architectures; parallel machines; shared memory systems; CUDA; GPU computing; GPU device buffers; GPU memory; GPU programming frameworks; GPU-GPU shmem getmem operation; I/O hub; OpenCL; OpenCL computing frameworks; OpenSHMEM atomics; OpenSHMEM model; OpenSHMEM runtime; PGAS models; Stencil2D kernel; abstraction; communication overheads; direct communication; graphics processing units; high compute density; high-performance communication; high-performance runtime; host memory; inter-node configurations; interoperable; intra-IOH; intra-node configurations; library-based programming model; minimum synchronization; modern supercomputer architectures; one-sided communication primitives; parallel scientific applications; partitioned global address space models; performance per watt; programmability; shared memory address space; Computational modeling; Context; Electronics packaging; Graphics processing units; Kernel; Programming; Runtime; CUDA; GPU; OpenCL; OpenSHMEM; PGAS;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-6066-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2013.104
  • Filename
    6569880