The empirical risk minimization problem (ERM) arises in most machine learning tasks, including logistic regression and some neural networks. Stochastic Gradient Descent (SGD) has been widely used to solve this problem thanks to its scalability and efficiency in dealing with large-scale tasks. Many variants of SGD involve momentum techniques which incorporate the past gradient information to the descent direction. Since the momentum methods offer encouraging practical performance, it is desirable to study their theoretical aspects and apply that knowledge to algorithm design. In this talk, we provide an overview of some of the stochastic momentum methods for the ERM problem and highlight some practical algorithms and/or settings where the momentum methods may have theoretical/heuristic advantages compared to plain SGD.