Self-tuned annealing in deep learning: How neural networks find generalizable solutions
Despite tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the SGD-based learning dynamics and the loss function landscape near solutions found by SGD, we discover a counter-intuitive relation between the weight fluctuation and the loss landscape – the flatter the landscape the smaller the weight variance. To explain this inverse variance-flatness relation, we develop a random landscape theory of SGD, which shows that noise strength (effective temperature) in SGD depends inversely on the landscape flatness and thus SGD serves effectively as a self-tuned (landscape-dependent) annealing mechanism to find the generalizable solutions at the flat minima of the loss landscape. Application of these new insights for preventing catastrophic forgetting will also be discussed.