DocumentCode :
1920257
Title :
Acceleration of Bilateral Filtering Algorithm for Manycore and Multicore Architectures
Author :
Agarwal, Dinesh ; Wilf, Sami ; Dhungel, Abinashi ; Prasad, Sushil K.
Author_Institution :
Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA, USA
fYear :
2012
fDate :
10-13 Sept. 2012
Firstpage :
78
Lastpage :
87
Abstract :
Bilateral filtering is an ubiquitous tool for several kinds of image processing applications. This work explores multicore and many core accelerations for the embarrassingly parallel yet compute-intensive bilateral filtering kernel. For many core architectures, we have created a novel pair-symmetric algorithm to avoid redundant calculations. For multicore architectures, we improve the algorithm by use of low-level single instruction multiple data (SIMD) parallelism across multiple threads. We propose architecture specific optimizations, such as exploiting the unique capabilities of special registers available in modern multicore architectures and the rearrangement of data access patterns as per the computations to exploit special purpose instructions. We also propose optimizations pertinent to Nvidia´s Compute Unified Device Architecture (CUDA), including utilization of CUDA´s implicit synchronization capability and the maximization of single-instruction-multiple-thread efficiency. We present empirical data on the performance gains we achieved over a variety of hardware architectures including Nvidia GTX 280, AMD Barcelona, AMD Shanghai, Intel Harper town, AMD Phenom, Intel Core i7 quad core, and Intel Nehalem 32 core machines. The best performance achieved was (i) 169-fold speedup by the CUDA-based implementation of our pair-symmetric algorithm running on Nvidia´s GTX 280 GPU compared to the compiler-optimized sequential code on Intel Core i7, and (ii) 38-fold speedup using 16 cores of AMD Barcelona each equipped with a 4-stage vector pipeline compared to the compiler-optimized sequential code running on the same machine.
Keywords :
filtering theory; image processing; multi-threading; multiprocessing systems; parallel architectures; pipeline processing; program compilers; synchronisation; ubiquitous computing; AMD Barcelona; AMD Phenom; AMD Shanghai; Intel Core i7 quad core; Intel Harper town; Nvidia GTX 280; SIMD parallelism; architecture specific optimizations; bilateral filtering algorithm acceleration; compute unified device architecture; compute-intensive bilateral filtering kernel; data access patterns; hardware architectures; image processing applications; low-level single instruction multiple data parallelism; many core accelerations; many core architectures; manycore architecture; multicore accelerations; multicore architectures; multiple threads; pair-symmetric algorithm; performance gains; redundant calculations; single-instruction-multiple-thread efficiency; special purpose instructions; special registers; synchronization capability; ubiquitous tool; Filtering algorithms; Graphics processing unit; Instruction sets; Kernel; Multicore processing; Synchronization; Bilateral filtering; Image processing on GPUs; Image processing on multicores; Stencil codes using CUDA; Streaming SIMD Extensions;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing (ICPP), 2012 41st International Conference on
Conference_Location :
Pittsburgh, PA
ISSN :
0190-3918
Print_ISBN :
978-1-4673-2508-0
Type :
conf
DOI :
10.1109/ICPP.2012.13
Filename :
6337569
Link To Document :
بازگشت