مرکز منطقه ای اطلاع رساني علوم و فناوري - Coordinated static and dynamic cache bypassing for GPUs

DocumentCode :

695228

Title :

Coordinated static and dynamic cache bypassing for GPUs

Author :

Xiaolong Xie ; Yun Liang ; Yu Wang ; Guangyu Sun ; Tao Wang

Author_Institution :

Center for Energy-Efficient Comput. & Applic., Peking Univ., Beijing, China

fYear :

2015

fDate :

7-11 Feb. 2015

Firstpage :

Lastpage :

Abstract :

The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource. Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this paper, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. In CUDA programming model, the threads are divided into work units called thread blocks. Our dynamic bypassing technique modulates the ratio of thread blocks that cache or bypass at run-time. We choose to modulate at thread block level in order to avoid the memory divergence problems. Our approach combines compile-time analysis that determines the cache or bypass preferences for global loads with run-time management that adjusts the ratio of thread blocks that cache or bypass. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average 1.32X) performance speedup for a variety of GPU applications.

Keywords :

cache storage; graphics processing units; multi-threading; parallel architectures; CUDA programming model; GPUs; bypass preferences; cache resource contention problem; compile-time analysis; coordinated dynamic cache bypassing; coordinated static cache bypassing; dynamic bypassing technique; graphics processing units; memory divergence problems; on-chip memory; parallel architecture; run-time management; scratchpad memory; thread blocks; thread contention; Arrays; Graphics processing units; Instruction sets; Kernel; Pipelines; Synchronization; System-on-chip;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on

Conference_Location :

Burlingame, CA

Type :

conf

DOI :

10.1109/HPCA.2015.7056023

Filename :

7056023

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=695228