Physical Review Letters

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

View publication


Generalization is one of the most important problems in deep learning, where there exist many low-loss solutions due to overparametrization. Previous empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand the effects of SGD, we construct a simple model whose overall loss landscape has a continuous set of degenerate (or near-degenerate) minima and the loss landscape for a minibatch is approximated by a random shift of the overall loss function. By direct simulations of the stochastic learning dynamics and solving the underlying Fokker-Planck equation, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. As a result, the flatness of the overall loss landscape increases during learning and reaches a higher value (flatter minimum) for a larger SGD noise strength before the noise strength reaches a critical value when the system fails to converge. These results, which are verified in realistic neural network models, elucidate the role of SGD for generalization, and they may also have important implications for hyperparameter selection for learning efficiently without divergence.