Title :
Enabling an OpenCL Compiler for Embedded Multicore DSP Systems
Author :
Li, Jia-Jhe ; Kuan, Chi-Bang ; Wu, Tung-Yu ; Lee, Jenq Kuen
Author_Institution :
Dept. of Comput. Sci., Nat. Tsing Hua Univ., Hsinchu, Taiwan
Abstract :
OpenCL is an industry´s attempt to unify heterogeneous multicore programming. With its programming model defining SPMD kernels, vector types, and address space qualifiers, OpenCL allows programmers to exploit data parallelism with multicore processors and SIMD instructions as well as data locality with memory hierarchy. Recently, OpenCL has gained success on many architectures, including multicore CPUs, GPUs, vector processors, embedded systems with application-specific processors, and even FPGAs. However, how to support OpenCL for embedded multicore DSP systems remains unaddressed. In this paper, we illustrate our OpenCL support for embedded multicore DSP systems. Our target platform consists of one MPU and a DSP subsystem with multiple DSPs. The DSPs we address are VLIW processors with clustered functional units and distributed register files. To generate efficient code for such DSPs, compilers are required to consider irregular register file access in many optimization phases. To utilize the DSPs with distributed register files, we propose a cluster-aware work-item dispatching scheme to vectorize OpenCL kernels and assign independent workload to clusters of a DSP. In addition, we also incorporate several optimizations to enable efficient DSP code generation. In our experiments, we employ a set of OpenCL benchmark programs to evaluate the effectiveness of our OpenCL support. The experiments are conducted on a DSP cycle-accurate simulator and a multicore evaluation board. We report average 29% performance improvement with our vectorization scheme and a near 2-fold speedup with two DSPs compared with a single-MPU setup.
Keywords :
digital signal processing chips; electronic engineering computing; embedded systems; field programmable gate arrays; graphics processing units; multiprocessing systems; operating system kernels; optimising compilers; parallel processing; program compilers; software performance evaluation; DSP code generation; DSP cycle-accurate simulator; DSP subsystem; FPGA; GPU; MPU subsystem; OpenCL benchmark programs; OpenCL compiler; OpenCL kernels; OpenCL support; SIMD instructions; SPMD kernels; VLIW processors; address space qualifiers; application-specific processors; cluster-aware work-item dispatching scheme; clustered functional units; compilers; data locality; data parallelism; distributed register files; embedded multicore DSP systems; embedded systems; heterogeneous multicore programming; independent workload; irregular register file access; memory hierarchy; multicore CPU; multicore evaluation board; multicore processors; optimization phases; performance improvement; programming model; single-MPU setup; vector processors; vector types; vectorization scheme; Digital signal processing; Kernel; Multicore processing; Program processors; Registers; VLIW; Vectors;
Conference_Titel :
Parallel Processing Workshops (ICPPW), 2012 41st International Conference on
Conference_Location :
Pittsburgh, PA
Print_ISBN :
978-1-4673-2509-7
DOI :
10.1109/ICPPW.2012.74