Dynamics of Deep Learning: Landscape-dependent Noise, Inverse Einstein Relation, and Flat Minima
Despite tremendous success of the Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. In this talk, we discuss our recent work on establishing a theoretical framework based on nonequilibrium statistical physics to understand the SGD learning dynamics, the loss function landscape, and their relationship. Our study shows that SGD dynamics follows a low-dimensional drift-diffusion motion in the weight space and the loss function is flat with large values of flatness (inverse of curvature) in most directions. Furthermore, our study reveals a robust inverse relation between the weight variance in SGD and the landscape flatness opposite to the fluctuation-response relation in equilibrium systems. We develop a statistical theory of SGD based on properties of the ensemble of minibatch loss functions and show that the noise strength in SGD depends inversely on the landscape flatness, which explains the inverse variance-flatness relation. Our study suggests that SGD serves as an “smart” annealing strategy where the effective temperature self-adjusts according to the loss landscape in order to find the flat minimum regions that contain generalizable solutions. Finally, we discuss an application of these insights for reducing catastrophic forgetting efficiently for sequential multiple tasks learning.