مرکز منطقه ای اطلاع رساني علوم و فناوري - An Efficient Compiler Framework for Cache Bypassing on GPUs

DocumentCode :

3601953

Title :

An Efficient Compiler Framework for Cache Bypassing on GPUs

Author :

Yun Liang ; Xiaolong Xie ; Guangyu Sun ; Deming Chen

Author_Institution :

Center for Energy-Efficient Comput. & Applic., Peking Univ., Beijing, China

Volume :

Issue :

fYear :

2015

Firstpage :

1677

Lastpage :

1690

Abstract :

Graphics processing units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly configurable. The programmer or compiler can explicitly control cache access or bypass for global load instructions. This highly configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we present techniques to explore the unified cache and shared memory design space. We integrate our techniques into an automatic compiler framework that leverages parallel thread execution instruction set architecture to enable cache bypassing for GPUs. Experiments evaluation on NVIDIA GTX680 using a variety of applications demonstrates that compared to cache-all and bypass-all solutions, our techniques improve the performance from 4.6% to 13.1% for 16 KB L1 cache.

Keywords :

cache storage; graphics processing units; instruction sets; parallel memories; program compilers; GPU cache utilization; GPU vendor; NVIDIA GTX680; automatic compiler framework; bypass-all solution; cache access; cache bypassing; cache performance; cache-all solution; computing power; configurable cache; general purpose GPU application; general purpose application; global load instruction; graphics processing unit; irregular memory access; memory traffic; on-chip memory; parallel thread execution instruction set architecture; performance metrics; scratchpad memory; shared memory design space; Computer architecture; Graphics processing units; Instruction sets; Instruments; Kernel; Optimization; System-on-chip; Cache Bypassing; Cache bypassing; Compiler; GPU; Performance; compiler; graphics processing unit (GPU); performance;

fLanguage :

English

Journal_Title :

Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

0278-0070

Type :

jour

DOI :

10.1109/TCAD.2015.2424962

Filename :

7090987

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3601953