A System-Level Transprecision FPGA Accelerator for BLSTM Using On-chip Memory Reshaping
The large amount of processing and storage of modern neural networks challenges engineers to architect dedicated and tailored hardware with high energy efficiency. At the inflection point of choosing among the most appropriate acceleration platform, FPGAs offer a competitive advantage with their irregular parallelism and bit-level re-programmability, at the cost of development effort. One critical problem is the lack of a common development flow between CPU and FPGA that combines advantages of both software and hardware world, i.e. integrated programmability and adaptable acceleration. This work presents a system-level FPGA implementation framework for BLSTM-based neural networks acceleration that introduces a) flexible reduced-precision (transprecision) data-paths and b) on-chip memory reshaping for storing model parameters. By evaluating the proposed architecture to an OCR application, it was possible to decrease the energy-To-solution by 21.9x and 2.6x compared to that of a POWER8 processor and a P100 GPU, respectively.