Self-tuned annealing in deep learning: How neural networks find generalizable solutions
Abstract
Despite tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the SGD-based learning dynamics and the loss function landscape near solutions found by SGD, we discover a counter-intuitive relation between the weight fluctuation and the loss landscape – the flatter the landscape the smaller the weight variance. To explain this inverse variance-flatness relation, we develop a random landscape theory of SGD, which shows that noise strength (effective temperature) in SGD depends inversely on the landscape flatness and thus SGD serves effectively as a self-tuned (landscape-dependent) annealing mechanism to find the generalizable solutions at the flat minima of the loss landscape. Application of these new insights for preventing catastrophic forgetting will also be discussed.