On Adam Trained Models and a Parallel Method to Improve the Generalization Performance
Abstract
Adam is a popular stochastic optimizer that uses adaptive estimates of lower-order moments to update weights and requires little hyper-parameter tuning. Some recent studies have called the generalization and out-of-sample behavior of such adaptive gradient methods into question, and argued that such methods are of only marginal value. Notably for many of the well-known image classification tasks such as CIFAR-10 and ImageNet-1K, current models with best validation performance are still trained with SGD with a manual schedule of learning rate reduction. We analyze Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset. Visualization shows that for classification Adam trained models frequently 'focus' on areas of the images not occupied by the objects to be classified. Weight statistics reveal that Adam trained models have larger weights and L2 norms than SGD trained ones. Our experiments show that weight decay and reducing the initial learning rates improve generalization performance of Adam, but there still remains a gap between Adam and SGD trained models. To bridge the generalization gap, we adopt a K-step model averaging parallel algorithm with the Adam optimizer. With very sparse communication, the algorithm achieves high parallel efficiency. For the 7 models the average improvement in validation accuracy over SGD is 0.72%, and the average parallel speedup is 2.5 with 6 GPUs.