Accelerating arithmetic kernels with coherent attached FPGA coprocessors
The energy efficiency of computer systems can be increased by migrating computational kernels that are known to under-utilize the CPU to an FPGA based coprocessor. In contrast to traditional I/O-based coprocessors that require explicit data movement, coherently attached accelerators can operate on the same virtual address space than the host CPU. A shared memory organization enables widely accepted programming models and helps to deploy energy efficient accelerators in general purpose computing systems. In this paper we study an FFT accelerator on FPGA attached via the Coherent Accelerator Processor Interface (CAPI) to a POWER8 processor. Our results show that the coherent attached accelerator outperforms device driver based approaches in terms of latency. Hardware acceleration delivers a 5× gain in energy efficiency compared to an optimized parallel software FFT running on a 12-core CPU and improves single thread performance by more than 2×. We conclude that the integration of CAPI into heterogeneous programming frameworks such as OpenCL will facilitate latency critical operations and will further enhance programmability of hybrid systems.