Publication
CLOUD 2020
Conference paper

Variable batch size across layers for efficient prediction on CNNs

View publication

Abstract

CNNs are used extensively for computer vision tasks like activity recognition, image classification, segmentation etc. The large compute memory required in these applications restricts the use of high batch size during inference, thereby increasing the overall prediction time. Prior work addresses this issue through various model compression mechanisms like weight/filter pruning, quantizing the parameters/intermediate outputs, etc. We propose a complementary technique where we improve inference time by using variable batch sizes (VBS) across the layers of a CNN. This optimises the memory-time trade-off for each layer and leads to better network throughput. Our approach does not make any modifications to the existing network (unlike pruning or quantization techniques) and thus there is no impact on the model accuracy. We develop a dynamic program (DP) based algorithm that takes inference time and memory required by different layers of the network as input, and computes the optimal batch sizes for each layer depending on the available resources (RAM, storage space etc.). We demonstrate our findings in two different settings: Video inference on K80 GPUs and image inference on Edge devices. On video networks like C3D, our VBS algorithm gives up to 61% higher throughput compared to a fixed batch size baseline. On image networks like GoogleNet, ResNet50 etc., we achieve up to 60% higher throughput compared to a fixed batch size baseline.

Date

18 Oct 2020

Publication

CLOUD 2020

Authors

Share