• DocumentCode
    1925582
  • Title

    Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

  • Author

    Jenkins, John ; Dinan, James ; Balaji, Pavan ; Samatova, Nagiza F. ; Thakur, Rajeev

  • Author_Institution
    Deptartment of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
  • fYear
    2012
  • fDate
    24-28 Sept. 2012
  • Firstpage
    468
  • Lastpage
    476
  • Abstract
    Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges GPU acceleration of large-scale scientific computations. A particular challenge is the transfer of noncontiguous data to and from GPU memory. MPI implementations currently do not provide an efficient means of utilizing data types for noncontiguous communication of data in GPU memory. To address this gap, we present an MPI data type-processing system capable of efficiently processing arbitrary data types directly on the GPU. We present a means for converting conventional data type representations into a GPU-amenable format. Fine-grained, element-level parallelism is then utilized by a GPU kernel to perform in-device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array sub volumes over CUDA-based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of data types that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end-to-end, GPU-to-GPU communication time. In addition, we identify and evaluate communication patterns that may cause resource contention with packing operations, providing a baseline for adaptively selecting data-processing strategies.
  • Keywords
    graphics processing units; message passing; parallel architectures; scientific information systems; vectors; 3D array slices; 4D array subvolumes; CUDA; GPU acceleration; GPU kernel; GPU memory; GPU-amenable format; MPI datatype-processing system; arbitrary datatype processing; communication pattern evaluation; communication pattern identification; data-processing strategies; datatype packing; datatype representations; end-to-end GPU-to-GPU communication time; fast-noncontiguous GPU data movement; fine-grained element-level parallelism; graphics processing units; hybrid MPI-plus-GPU Environments; in-device packing; in-device unpacking; large-scale scientific computations; noncontiguous column vectors; noncontiguous data communication; noncontiguous data transfer; overhead; performance improvement; resource contention; Arrays; Encoding; Graphics processing unit; Instruction sets; Kernel; Parallel processing; Vectors; Datatype; GPU; MPI;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing (CLUSTER), 2012 IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4673-2422-9
  • Type

    conf

  • DOI
    10.1109/CLUSTER.2012.72
  • Filename
    6337810