Memory and Interconnect Optimizations for Peta-Scale Deep Learning Systems
Hardware accelerators are a promising solution to the stringent computational requirements of Deep Neural Networks (DNNs). Ranging from low-power IP cores to server class systems, various accelerator architectures with high TOPS/W peak processing efficiencies and flexibility to execute different DNN topologies have been proposed. Prior efforts improve core utilization through better data-flows and computation sequencing, but little effort has thus far been devoted to systematically programming DNN accelerators to extract best possible system utilization, particularly for DNN training, which can be parallelized across peta-scale systems. In this work, we address the hitherto open challenge of systematically mapping computations onto Peta-scale accelerator systems, comprising many (thousands of) processing cores spanning many chips, while maximizing overall system performance. We achieve this by characterizing the design space of possible mapping configurations, building a detailed performance model that incorporates every computation and data-transfer involved in DNN training, and using a design space exploration tool called DEEPSPATIALMATRIX to identify the performance optimal configuration. We highlight 4 key optimizations built within DEEPSPATIALMATRIX-hybrid data-model parallelism, inter-layer memory reuse, time-step pipelining, and dynamic spatial minibatching-each of which improve system utilization by carefully managing the available memory capacity and interconnect bandwidth to balance the compute vs. communication costs. On a 8-peta-FLOP accelerator system, we demonstrate 1.36×-32× improvement in training performance through our design space exploration and optimizations across image recognition (VGG16, ResNet50) and machine translation (GNMT) DNN models.