• DocumentCode
    168684
  • Title

    Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

  • Author

    Oden, Lena ; Klenk, Benjamin ; Froning, Holger

  • Author_Institution
    Competence Center High Perfomance Comput., Fraunhofer Inst. for Ind. Math., Kaiserslautern, Germany
  • fYear
    2014
  • fDate
    26-29 May 2014
  • Firstpage
    483
  • Lastpage
    492
  • Abstract
    GPUs gain high popularity in High Performance Computing, due to their massive parallelism and high performance per Watt. Despite their popularity, data transfer between multiple GPUs in a cluster remains a problem. Most communication models require the CPU to control the data flow, also intermediate staging copies to host memory are often inevitable. These two facts lead to higher CPU and memory utilization. As a result, overall performance decreases and power consumption increases. Collective operations like reduce and all reduce are very common in scientific simulations and also very sensitive to performance. Due to their massive parallelism, GPUs are very suitable for such operations, but they only excel in performance if they can process the problem in-core. Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPU´s thread-collective execution model and does not require CPU assistance or staging copies in host memory. As we will see, GGAS helps to process collective operations among distributed GPUs in-core. In this paper, we introduce the implementation and optimization of collective reduce and all reduce operations using GGAS as a communication model. Compared to message passing, we get a speedup of 1.7x for small data sizes. A detailed analysis based on power measurements of CPU, host memory and GPU reveals that GGAS as communication model not only saves cycles, also the power and energy consumption is reduced dramatically. For instance, for an all reduce operation half of the energy can be saved by the reduced the power consumption in combination with the lower run time.
  • Keywords
    data flow computing; graphics processing units; multi-threading; power aware computing; storage management; workstation clusters; CPU; GGAS; collective operation processing; communication models; data flow control; data transfer; direct GPU-to-GPU communication; distributed GPUs in-core; energy consumption reduction; energy-efficient collective allreduce operation; energy-efficient collective reduce operation; global GPU address spaces; heterogeneous clusters; high performance computing; host memory; intermediate staging; memory utilization; power consumption reduction; power measurements; scientific simulations; thread-collective execution model; Bandwidth; Data transfer; Graphics processing units; Instruction sets; Message systems; Performance evaluation; Synchronization; Collective Operations; Data Transfer; Energy; GPUs; Global Address Space; Power;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/CCGrid.2014.21
  • Filename
    6846484