مرکز منطقه ای اطلاع رساني علوم و فناوري - Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

DocumentCode :

2996423

Title :

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Author :

Wu, Haicheng ; Diamos, Gregory ; Wang, Jin ; Cadambi, Srihari ; Yalamanchili, Sudhakar ; Chakradhar, Srimat

Author_Institution :

Sch. of ECE, Georgia Inst. of Technol., Atlanta, GA, USA

fYear :

2012

fDate :

21-25 May 2012

Firstpage :

2433

Lastpage :

2442

Abstract :

Data warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes a set of compiler optimizations to address these challenges. Inspired in part by loop fusion/fission optimizations in the scientific computing community, we propose kernel fusion and kernel fission. Kernel fusion fuses the code bodies of two GPU kernels to i) eliminate redundant operations across dependent kernels, ii) reduce data movement between GPU registers and GPU memory, iii) reduce data movement between GPU memory and CPU memory, and iv) improve spatial and temporal locality of memory references. Kernel fission partitions a kernel into segments such that segment computations and data transfers between the GPU and host CPU can be overlapped. Fusion and fission can also be applied concurrently to a set of kernels. We empirically evaluate the benefits of fusion/fission on relational algebra operators drawn from the TPC-H benchmark suite. All kernels are implemented in CUDA and the experiments are performed with NVIDIA Fermi GPUs. In general, we observed data throughput improvements ranging from 13.1% to 41.4% for the SELECT operator and queries Q1 and Q21 in the TPC-H benchmark suite. We present key insights, lessons learned, and opportunities for further improvements.

Keywords :

data warehouses; graphics processing units; optimising compilers; parallel architectures; query processing; relational algebra; storage management; CPU memory; CUDA; GPU memory; GPU registers; NVIDIA Fermi GPU; TPC-H benchmark suite; compiler optimizations; data movement reduction; data throughput improvements; data transfers; data warehousing applications; general purpose GPU; graphics processing unit; kernel fission; kernel fusion; loop fission optimization; loop fusion optimization; memory reference spatial locality; memory reference temporal locality; redundant operation elimination; relational algebra operators; relational computation processing; relational query processing; scientific computing community; segment computations; Bandwidth; Graphics processing unit; Kernel; Memory management; Optimization; Throughput; Warehousing; GPU; compiler; data warehousing; optimization; relational algebra;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International

Conference_Location :

Shanghai

Print_ISBN :

978-1-4673-0974-5

Type :

conf

DOI :

10.1109/IPDPSW.2012.300

Filename :

6270615

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2996423