Author :
Ovtcharov, Kalin ; Tili, Ilian ; Steffan, J. Gregory
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Toronto, Toronto, ON, Canada
Abstract :
Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both H- dgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in the dataflow graph of the application; and (iv) architectural parameters.
Keywords :
field programmable gate arrays; high level synthesis; multi-threading; program compilers; scheduling; Black-Scholes options pricing model; C language; FPGA-based compute engine; FU usage; Hodgkin-Huxley neuron simulation; IEEE-754 floating-point FU; ILP; LLVM-based scheduler; OpenCL language; Stratix IV FPGA; architectural tradeoff; bank multiporting; compiler-generated circuit; compute density; dataflow graph; field programmable gate array; functional unit; high-level synthesis system; instruction-level parallelism; integer adder; memory bank; multithreaded FPGA-based compute engine; multithreaded VLIW soft processor family; per-function synthesized circuit; scheduling; soft processing engine; static compiler analysis; thread context; thread-level parallelism; very long instruction word; Area measurement; Computer architecture; Context; Engines; Field programmable gate arrays; Instruction sets; Throughput;