A Linear Performance-Breakdown Model for GPU Programming Optimization Guidance

Author

Chapa M, Mario A. ; Hiroyuki, Sato

Author_Institution

Dept. of Electr. Eng. & Inf. Sci., Univ. of Tokyo, Tokyo, Japan

fYear

2014

fDate

19-23 May 2014

Firstpage

596

Lastpage

603

Abstract

The use Graphic Processing Units (GPU) as computing accelerators has been. Nevertheless, writing efficient GPU programs is a difficult and time consuming task. In this paper we present the Linear Performance Breakdown Model (LBPM), an analytic model that is used to extract the breakdown of GPU kernel programs execution time into the three major components that affect its running time. The model can be used as a tool to provide guidelines to detect the performance bottlenecks. Our approach is the incorporation of three elements, the Global-to-Shared Memory Time slice, Shared-to-Private Time slice and Processing Units Time slice. These three factors are integrated into a performance model formula by applying the Normalized Least Squares Method (NLSM). The resulting coefficients are used to construct a performance breakdown graph that reveals the effects of each element in the total execution time of the kernel program. We demonstrate the results obtained with our proposed model with two common numeric routines: Single-Precision General Matrix Multiplication (SGMM) and Fast Fourier Transform (FFT), and apply the model to the results obtained from two GPU devices: A8-3870 AMD Accelerated Processing Unit (APU) and a GTX 660 Nvidia GPU.

Keywords

fast Fourier transforms; graph theory; graphics processing units; least squares approximations; matrix multiplication; shared memory systems; software performance evaluation; A8-3870 AMD accelerated processing unit; APU; FFT; GPU devices; GPU kernel program execution; GPU programming optimization guidance; GTX 660 Nvidia GPU; LBPM; NLSM; SGMM; analytic model; computing accelerators; fast Fourier transform; global-to-shared memory time slice; graphic processing units; kernel program; linear performance-breakdown model; normalized least squares method; performance breakdown graph; processing unit time slice; shared-to-private time slice; single-precision general matrix multiplication; time consuming task; Computational modeling; Computer architecture; Graphics processing units; Kernel; Performance evaluation; Programming; Registers; GPGPU; Modeling; OpenCL;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International

Conference_Location

Phoenix, AZ

Print_ISBN

978-1-4799-4117-9

Type

conf

DOI

10.1109/IPDPSW.2014.70

Filename

6969440