Reducing global reductions in large-scale distributed training
Abstract
Current large-scale training of deep neural networks typically employs synchronous stochastic gradient descent that incurs large communication overhead. Instead of optimizing reduction routines as done in recent studies, we propose algorithms that do not require frequent global reductions. We first show that reducing the global reduction frequency works as an effective regularization technique that improves generalization of adaptive optimizers. We then propose an algorithm that reduces the global reduction frequency by employing local reductions on a subset of learners. In addition, to maximize the effect of reduction on convergence, we introduce reduction momentum that further accelerates convergence. Our experiment with the CIFAR-10 dataset shows that for the K-step averaging algorithm extremely sparse reductions help bridge the generalization gap. With 6 GPUs, in comparison to regular synchronous implementations, our implementation reduces more than 99% of global reductions. Further, we show that with 32 GPUs, our implementation reduces the number of global reductions by half. Experimenting with the ImageNet-1K dataset, we show that combining local reductions with global reductions and applying reduction momentum can further reduce global reductions by up to 62% for the same validation accuracy achieved in comparison to K-step averaging. With 400 GPUs global reduction frequency is reduced to once per 102K samples.