DocumentCode :
238572
Title :
Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads
Author :
Kyoshin Choo ; Panlener, William ; Byunghyun Jang
Author_Institution :
Comput. & Inf. Sci., Univ. of Mississippi, Oxford, MS, USA
fYear :
2014
fDate :
24-27 June 2014
Firstpage :
189
Lastpage :
196
Abstract :
Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general purpose co-processors with CPUs sharing the load, good cache design and use becomes increasingly important. While both CPUs and GPUs must cooperate and perform well, their memory access patterns are very different. On CPUs only a few threads access memory simultaneously. On GPUs, there is significantly higher memory access contention among thousands of threads. Despite such different behavior, there is little research that investigates the behavior and performance of GPU caches in depth. In this paper, we present our extensive study on the characterization and improvement of GPU cache behavior and performance for general-purpose workloads using a cycle-accurate ISA level GPU architectural simulator that models one of the latest GPU architectures, Graphics Core Next (GCN) from AMD. Our study makes the following observations and improvements. First, we observe that L1 vector data cache hit rate is substantially lower when compared to CPU caches. The main culprit is compulsory misses caused by lack of data reuse among massively simultaneous threads. Second, there is significant memory access contention in shared L2 data cache, accounting for up to 19% of total access for some benchmarks. This high contention remains a main performance barrier in L2 data cache even though its hit rate is high. Third, we demonstrate that memory access coalescing plays a critical role in reducing memory traffic. Finally we found that there exists inter-workgroup locality which can affect the cache behavior and performance. Our experimental results show memory performance can be improved by 1) shared L1 vector data cache where multiple compute units share a single cache to exploit inter-workgroup locality and increase data reusability, and 2) clustered workgroup scheduling where workgroups with consecutive IDs are assigned on th- same compute unit.
Keywords :
cache storage; graphics processing units; scheduling; AMD; CPUs; GCN; GPU architectures; GPU cache behavior; GPU cache memory performance optimization; L1 vector data cache hit rate; cache design; cache technology; clustered workgroup scheduling; compute workloads; coprocessors; cycle-accurate ISA level GPU architectural simulator; data reusability; data reuse; general-purpose workloads; graphics core next; interworkgroup locality; load sharing; memory access contention; memory access patterns; memory traffic reduction; processing elements; processor memory subsystem performance gap; shared L1 vector data cache; shared L2 data cache; Benchmark testing; Computer architecture; Discrete cosine transforms; Graphics processing units; Hardware; Instruction sets; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Computing (ISPDC), 2014 IEEE 13th International Symposium on
Conference_Location :
Marseilles
Print_ISBN :
978-1-4799-5918-1
Type :
conf
DOI :
10.1109/ISPDC.2014.29
Filename :
6900219
Link To Document :
بازگشت