مرکز منطقه ای اطلاع رساني علوم و فناوري - Optimality of Fundamental Parallel Algorithms on the Hierarchical Memory Machine, with GPU Implementation

DocumentCode :

704176

Title :

Optimality of Fundamental Parallel Algorithms on the Hierarchical Memory Machine, with GPU Implementation

Author :

Nakano, Koji ; Ito, Yasuaki

Author_Institution :

Dept. of Inf. Eng., Hiroshima Univ., Higashi-Hiroshima, Japan

fYear :

2015

fDate :

4-6 March 2015

Firstpage :

626

Lastpage :

634

Abstract :

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of CUDA-enabled GPU architecture. It has multiple streaming multiprocessors with a shared memory, and the global memory that can be accessed by all threads. The HMM has several parameters: the number d of streaming multiprocessors, the number p of threads per streaming multiprocessor, the number w of memory banks of each shared memory and the global memory, shared memory latency l, and global memory latency L. The main purpose of this paper is to discuss optimality of fundamental parallel algorithms running on the HMM. We first show that image convolution for an image with n × n pixels using a filter of size (2v+1) × (2v+1) can be done in O(n²/w+n²L/dp+n²v²/dw+n²v²l/dp) time units on the HMM. Further, we show that this parallel implementation is time optimal by proving the lower bound of the running time. We then go on to show that the product of two n × n matrices can be computed in O(n³/mw+n³L/mdp+n³/dw+n³l/dp) time units on the HMM if the capacity of the shared memory in each streaming multiprocessor is O(m²). This implementation is also proved to be time optimal. We further clarify the conditions for image convolution and matrix multiplication to hide the memory access latency overhead and to maximize the global memory throughput and the parallelism. Finally, we provide experimental results on GeForce GTX Titan to support our theoretical analysis.

Keywords :

computational complexity; convolution; graphics processing units; matrix multiplication; parallel algorithms; parallel architectures; shared memory systems; CUDA-enabled GPU architecture; GeForce GTX Titan; fundamental parallel algorithms; global memory latency; global memory throughput maximization; hierarchical memory machine; image convolution; matrix multiplication; memory access latency overhead hiding; multiple streaming multiprocessors; running time; shared memory latency; theoretical parallel computing model; World Wide Web; CUDA; GPU; Image convolution; matrix multiplication; memory machine models; parallel algorithms;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on

Conference_Location :

Turku

ISSN :

1066-6192

Type :

conf

DOI :

10.1109/PDP.2015.46

Filename :

7092785

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=704176