مرکز منطقه ای اطلاع رساني علوم و فناوري - A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

DocumentCode :

743984

Title :

A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

Author :

Sang-Won Ha ; Tack-Don Han

Author_Institution :

Dept. of Comput. Sci., Yonsei Univ., Seoul, South Korea

Volume :

Issue :

fYear :

2013

Firstpage :

2324

Lastpage :

2333

Abstract :

The parallel scan is a basic tool that is used to parallelize algorithms which appear to have serial dependencies. The performance of these algorithms relies heavily on the efficiency of the parallel scan that is being used. To maintain work efficiency, current parallelization methods either sacrifice the overall depth or limit the scalability. In this study, we present a parallel scan method that is derived from the Han-Carlson parallel prefix graph and is both a work-efficient and a depth-optimal process. In this method, the depth is increased by a small constant value above the lower bound; therefore, the amount of computation and/or memory access is effectively reduced. We also employ a novel cascaded thread-block execution method to exploit the single-program-multiple-data (SPMD) nature of the compute unified device architecture (CUDA) environment developed by NVIDIA. The proposed method facilitates the low-latency interthread accessible shared memory and the single-instruction-multiple-thread (SIMT) characteristics of the graphics hardware to reduce high-latency global memory access and costly barrier synchronization. Our experimental results demonstrate an average speed up of approximately 40 and 10 percent over the CUDA data parallel primitives (CUDPP) library derivation of the Kogge-Stone prefix tree and an implementation of Merrill and Grimshaw´s method with coarser combination of the Kogge-Stone graph and the Brent-Kung prefix graph, respectively.

Keywords :

graphics processing units; instruction sets; multi-threading; parallel architectures; shared memory systems; storage management; synchronisation; trees (mathematics); Brent-Kung prefix graph; CUDA data parallel primitives; CUDA environment; CUDPP library; Compute Unified Device Architecture; GPGPU environment; Han-Carlson parallel prefix graph; Kogge-Stone graph; Kogge-Stone prefix tree; Merrill-Grimshaw method; NVIDIA; SIMT characteristics; SPMD nature; algorithm parallelization; barrier synchronization; cascaded thread-block execution method; depth-optimal parallel scan; depth-optimal process; general-purpose graphics processing unit; graphics hardware; high-latency global memory access; low-latency interthread accessible shared memory; scalable work-efficient parallel scan; serial dependency; single-instruction-multiple-thread characteristics; single-program-multiple-data nature; work-efficient process; Algorithm design and analysis; Complexity theory; Graphics processing units; Instruction sets; GPGPU; Han-Carlson adder; Parallel scan; high-performance computing; prefix sum;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2012.336

Filename :

6392164

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=743984