About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
APS March Meeting 2020
Talk
Self-tuned annealing in deep learning: How neural networks find generalizable solutions
Abstract
Despite tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the SGD-based learning dynamics and the loss function landscape near solutions found by SGD, we discover a counter-intuitive relation between the weight fluctuation and the loss landscape – the flatter the landscape the smaller the weight variance. To explain this inverse variance-flatness relation, we develop a random landscape theory of SGD, which shows that noise strength (effective temperature) in SGD depends inversely on the landscape flatness and thus SGD serves effectively as a self-tuned (landscape-dependent) annealing mechanism to find the generalizable solutions at the flat minima of the loss landscape. Application of these new insights for preventing catastrophic forgetting will also be discussed.