Phases of learning dynamics in artificial neural networks in the absence or presence of mislabeled data
Despite the tremendous success of deep neural networks in machine learning, the underlying reason for their superior learning capability remains unclear. Here, we present a framework based on statistical physics to study the dynamics of stochastic gradient descent (SGD), which drives learning in neural networks. Using the minibatch gradient ensemble, we construct order parameters to characterize the dynamics of weight updates in SGD. In the case without mislabeled data, we find that the SGD learning dynamics transitions from a fast learning phase to a slow exploration phase, which is associated with large changes in the order parameters that characterize the alignment of SGD gradients and their mean amplitude. In a more complex case, with randomly mislabeled samples, the SGD learning dynamics falls into four distinct phases. First, the system finds solutions for the correctly labeled samples in phase I; it then wanders around these solutions in phase II until it finds a direction that enables it to learn the mislabeled samples during phase III, after which, it finds solutions that satisfy all training samples during phase IV. Correspondingly, the test error decreases during phase I and remains low during phase II; however, it increases during phase III and reaches a high plateau during phase IV. The transitions between different phases can be understood by examining changes in the order parameters that characterize the alignment of the mean gradients for the two datasets (correctly and incorrectly labeled samples) and their (relative) strengths during learning. We find that individual sample losses for the two datasets are separated the most during phase II, leading to a data cleansing process that eliminates mislabeled samples and improves generalization. Overall, we believe that an approach based on statistical physics and stochastic dynamic systems theory provides a promising framework for describing and understanding learning dynamics in neural networks, which may also lead to more efficient learning algorithms.