• DocumentCode
    1902642
  • Title

    Convergence and scalarization for data-parallel architectures

  • Author

    Yunsup Lee ; Krashinsky, Ronny ; Grover, Vinod ; Keckler, Stephen W. ; Asanovic, Krste

  • Author_Institution
    Univ. of California at Berkeley, Berkeley, CA, USA
  • fYear
    2013
  • fDate
    23-27 Feb. 2013
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31% memory address counts by 47%, and data access counts by 38%.
  • Keywords
    convergence; multi-threading; optimising compilers; parallel architectures; power aware computing; GPUs; application kernels; arbitrary control flow; automatic scalar operation detection; compile-time analysis; compiler algorithms; convergent execution; data access counts; data parallelism; data-parallel architecture convergence; data-parallel architecture scalarization; dynamic oracle; energy inefficiency; instruction dispatch; memory address counts; memory operations; parallel code; register file reads; register file writes; scalarizing compiler; threaded code; threaded programming model; throughput processors; vector architectures; Algorithm design and analysis; Computer architecture; Convergence; Graphics processing units; Instruction sets; Kernel; Registers; CUDA; GPU; Scalarization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on
  • Conference_Location
    Shenzhen
  • Print_ISBN
    978-1-4673-5524-7
  • Type

    conf

  • DOI
    10.1109/CGO.2013.6494995
  • Filename
    6494995