مرکز منطقه ای اطلاع رساني علوم و فناوري - Microarchitectural performance characterization of irregular GPU kernels

DocumentCode :

186369

Title :

Microarchitectural performance characterization of irregular GPU kernels

Author :

O´Neil, Molly A. ; Burtscher, Martin

Author_Institution :

Dept. of Comput. Sci., Texas State Univ., San Marcos, TX, USA

fYear :

2014

fDate :

26-28 Oct. 2014

Firstpage :

130

Lastpage :

139

Abstract :

GPUs are increasingly being used to accelerate general-purpose applications, including applications with data-dependent, irregular memory access patterns and control flow. However, relatively little is known about the behavior of irregular GPU codes, and there has been minimal effort to quantify the ways in which they differ from regular GPGPU applications. We examine the behavior of a suite of optimized irregular CUDA applications on a cycle-accurate GPU simulator. We characterize the performance bottlenecks in each program and connect source code with microarchitectural characteristics. We also assess the impact of improvements in cache and DRAM bandwidth and latency and discuss the implications for GPU architecture design. We find that, while irregular graph codes exhibit significantly more underutilized execution cycles due to branch divergence, load imbalance, and synchronization overhead than regular programs, these factors contribute less to performance degradation than we expected. It appears that code optimizations are often able to effectively address these performance hurdles. Insufficient bandwidth and long memory latency are the biggest limiters of performance. Surprisingly, we find that applications with irregular memory access patterns are more sensitive to changes in L2 latency and bandwidth than DRAM latency and bandwidth.

Keywords :

DRAM chips; cache storage; graphics processing units; parallel architectures; source code (software); synchronisation; DRAM bandwidth improvement; DRAM latency; GPU architecture design; L2 bandwidth; L2 latency; branch divergence; cache improvement; code optimizations; control flow; cycle-accurate GPU simulator; data-dependent-irregular memory access patterns; general-purpose applications; irregular GPU code behavior; irregular GPU kernels; irregular graph codes; irregular memory access patterns; latency improvement; load imbalance; memory latency; microarchitectural performance characterization; optimized irregular CUDA applications; performance degradation; regular GPGPU applications; source code; synchronization overhead; underutilized execution cycles; Bandwidth; Benchmark testing; Graphics processing units; Hardware; Kernel; Pipelines; Random access memory;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Workload Characterization (IISWC), 2014 IEEE International Symposium on

Conference_Location :

Raleigh, NC

Print_ISBN :

978-1-4799-6452-9

Type :

conf

DOI :

10.1109/IISWC.2014.6983052

Filename :

6983052

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=186369