Alleviating Load Imbalance in Data Processing for Large-Scale Deep Learning

Sarunya Pumma; Daniele Buono; Fabio Checconi; Xinyu Que; Wu Chun Feng

doi:10.1109/CCGrid49817.2020.00-67

CCGRID 2020

Conference paper

01 May 2020

Alleviating Load Imbalance in Data Processing for Large-Scale Deep Learning

View publication

Abstract

Scalable deep learning remains an onerous challenge, as it is constrained by many factors, including those related to load imbalance. For many deep-learning software systems, multiple data-processing components-including neural network training, graph scheduling, input pipeline, and gradient synchronization-execute simultaneously and asynchronously. Such execution can cause the various data-processing components to contend with one another for the hardware resources, leading to severe load imbalance and, in turn, degraded scalability. In this paper, we present an in-depth analysis of state-of-the-art deep-learning software, TensorFlow and Horovod, to understand their scalability limitations. Based on this analysis, we propose four novel solutions that minimize resource contention and improve deep-learning performance by up to 35% for training various neural networks on 24,576 GPUs of the Summit supercomputer at Oak Ridge National Laboratory.

Paper