Performance-driven Programming of Multi-TFLOP Deep Learning Accelerators∗
Deep Neural Network (DNN) accelerator architecture have evolved rapidly in recent years demonstrating impressive peak processing efficiencies. However, little effort has been devoted towards developing systematic methodologies to program DNN accelerators to extract the best accelerator utilization across a range of DNN workloads. This becomes critical as DNN layers vary dramatically in their computational characteristics, necessitating them to be programmed differently to maximize overall performance. In this work, we address this challenge in the context of the RaPiD multi-TFLOP DNN accelerator proposed in , which comprises of a 2D-systolic array of processing elements, a 1D-array of special function units and a scratchpad memory. We develop DeepMatrix, a framework that enables systematic exploration of the design space to map DNNs to a given accelerator architecture, which can discover even non-intuitive optimization strategies to achieve high utilization. Specifically, given a DNN, it identifies how the computations need to be spatiotemporally sequenced, how much data needs to be staged at each level in the memory hierarchy and when data-transfers between memory hierarchies need to occur so that performance is maximized while meeting the constraints imposed by the hardware (processing power, memory capacity, bandwidth etc). DeepMatrix achieves this by building a parameterized design space of mapping configurations, and uses a design space exploration methodology to identify the best configuration. Across multiple large and practical DNNs (AlexNet, ResNet, VGG), we demonstrate DeepMatrix can yield 1.4x-2.8x improvement in performance over hand-tuned homogenous mapping.