مرکز منطقه ای اطلاع رساني علوم و فناوري - Acceleration of Bilateral Filtering Algorithm for Manycore and Multicore Architectures

DocumentCode :

1920257

Title :

Acceleration of Bilateral Filtering Algorithm for Manycore and Multicore Architectures

Author :

Agarwal, Dinesh ; Wilf, Sami ; Dhungel, Abinashi ; Prasad, Sushil K.

Author_Institution :

Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA, USA

fYear :

2012

fDate :

10-13 Sept. 2012

Firstpage :

Lastpage :

Abstract :

Bilateral filtering is an ubiquitous tool for several kinds of image processing applications. This work explores multicore and many core accelerations for the embarrassingly parallel yet compute-intensive bilateral filtering kernel. For many core architectures, we have created a novel pair-symmetric algorithm to avoid redundant calculations. For multicore architectures, we improve the algorithm by use of low-level single instruction multiple data (SIMD) parallelism across multiple threads. We propose architecture specific optimizations, such as exploiting the unique capabilities of special registers available in modern multicore architectures and the rearrangement of data access patterns as per the computations to exploit special purpose instructions. We also propose optimizations pertinent to Nvidia´s Compute Unified Device Architecture (CUDA), including utilization of CUDA´s implicit synchronization capability and the maximization of single-instruction-multiple-thread efficiency. We present empirical data on the performance gains we achieved over a variety of hardware architectures including Nvidia GTX 280, AMD Barcelona, AMD Shanghai, Intel Harper town, AMD Phenom, Intel Core i7 quad core, and Intel Nehalem 32 core machines. The best performance achieved was (i) 169-fold speedup by the CUDA-based implementation of our pair-symmetric algorithm running on Nvidia´s GTX 280 GPU compared to the compiler-optimized sequential code on Intel Core i7, and (ii) 38-fold speedup using 16 cores of AMD Barcelona each equipped with a 4-stage vector pipeline compared to the compiler-optimized sequential code running on the same machine.

Keywords :

filtering theory; image processing; multi-threading; multiprocessing systems; parallel architectures; pipeline processing; program compilers; synchronisation; ubiquitous computing; AMD Barcelona; AMD Phenom; AMD Shanghai; Intel Core i7 quad core; Intel Harper town; Nvidia GTX 280; SIMD parallelism; architecture specific optimizations; bilateral filtering algorithm acceleration; compute unified device architecture; compute-intensive bilateral filtering kernel; data access patterns; hardware architectures; image processing applications; low-level single instruction multiple data parallelism; many core accelerations; many core architectures; manycore architecture; multicore accelerations; multicore architectures; multiple threads; pair-symmetric algorithm; performance gains; redundant calculations; single-instruction-multiple-thread efficiency; special purpose instructions; special registers; synchronization capability; ubiquitous tool; Filtering algorithms; Graphics processing unit; Instruction sets; Kernel; Multicore processing; Synchronization; Bilateral filtering; Image processing on GPUs; Image processing on multicores; Stencil codes using CUDA; Streaming SIMD Extensions;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel Processing (ICPP), 2012 41st International Conference on

Conference_Location :

Pittsburgh, PA

ISSN :

0190-3918

Print_ISBN :

978-1-4673-2508-0

Type :

conf

DOI :

10.1109/ICPP.2012.13

Filename :

6337569

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1920257