Title :
Extracting speedup from C-code with poor instruction-level parallelism
Author :
Kusic, Dara ; Hoare, Raymond ; Jones, Alex K. ; Fazekas, Joshua ; Foster, John
Author_Institution :
Electr. & Comput. Eng., Pittsburgh Univ., PA, USA
Abstract :
Scientific computing and multimedia applications frequently call loop-intensive functions that dominate execution time. Applying homogeneous, parallel processors (e.g. single-instruction, multiple-data (SIMD) and very-long instruction word (VLIW)) is a common approach to minimizing execution time. However, many benchmark applications offer disappointing degrees of instruction-level parallelism (ILP) that cause these architectures to fall short of expected performance gains. This paper presents findings on execution time speedup achieved by heterogeneous massively parallel processors - standard reduced instruction-set computing (RISC) CPUs tightly coupled with arrays of super-complex instruction-set computing (SuperCISC) datapaths on the same chip. SuperCISC datapaths are created by mapping frequently-called functions into reconfigurable hardware. Encouraging performance results from the RISC/SuperCISC architecture point to the efficiency of reconfigurable devices to support large numbers of parallel computational accelerators. Calls to SuperCISC functions can greatly expedite execution time when applied to CPUs that support extensible instruction sets. In this paper we show how SuperCISC functions can accelerate an application up to 25x over a 4-way VLIW. SuperCISC functions show superlinear speedup, a performance gain significantly greater than the software´s ILP. SuperCISC functions also benefit from cycle compression, or a reduction of the idle cycle time for an operation to execute within a traditional CPU. Implementing software controls, or if-then-else statements, as hardware multiplexers within a SuperCISC function further advances performance.
Keywords :
instruction sets; parallel architectures; parallel machines; reconfigurable architectures; reduced instruction set computing; call loop-intensive function; cycle compression; heterogeneous massively parallel processor; idle cycle time; instruction-level parallelism; multimedia application; parallel processor; reconfigurable hardware; standard reduced instruction-set computing; super-complex instruction-set computing; Computer aided instruction; Computer architecture; Concurrent computing; Hardware; Parallel processing; Performance gain; Reduced instruction set computing; Scientific computing; Software performance; VLIW;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International
Print_ISBN :
0-7695-2312-9
DOI :
10.1109/IPDPS.2005.216