Acceleration of Large Deep Learning Training with Hybrid GPU Memory Management of Swapping and Re-computing

Haruki Imai; Tung D. Le; Yasushi Negishi; Kiyokuni Kawachiya

doi:10.1109/BigData50022.2020.9378165

Big Data 2020

Conference paper

10 Dec 2020

Acceleration of Large Deep Learning Training with Hybrid GPU Memory Management of Swapping and Re-computing

View publication

Abstract

Deep learning has achieved overwhelmingly better accuracy than existing methods in various fields. To further improve deep learning, a deeper and larger neural network (NN) model is indispensable, but graphics processing unit (GPU) memory is not large enough to train such NN models. One promising method to reduce GPU memory consumption is data swapping, which eases the burden on GPU memory by swapping out intermediate data from GPU memory to central processing unit (CPU) memory while the data are not necessary. However, this method introduces communication overhead in transferring data between GPU and CPU memory. Another method is recomputation, which discards intermediate data once and then computes them again when necessary. Unlike the data-swapping method, the re-computation method introduces additional computation but does not require CPU-GPU communication. Therefore, it may reduce CPU-GPU communication by introducing it effectively in the data-swapping method. In this paper, we developed a faster training method for large NN models by combining the re-computation and the data-swapping methods. This method edits the graph of TensorFlow automatically. We developed heuristics about which part of the graph should be recomputed. The heuristics divides the graph into sub-graphs and applies the re-computation in the decreasing order of the amount of swapping data in the sub-graphs. This heuristics is intended to reduce maximal communications by re-computing minimal sub-graphs. Our hybrid method improves performance by up to 15.5% in image size 8000×8000 of ResNet50, 15.3% in image size 7500×7500 of ResNet152, and 12.4% in image size 1000×1000 of DeepLabV3+ compared with the existing data-swapping method.

Conference paper