Stochastic gradient descent (SGD) is widely used by many machine learning algorithms. It is efficient for big data applications due to its low algorithmic complexity. SGD is inherently serial and its parallelization is not trivial. How to parallelize SGD on many-core architectures (e.g. GPUs) for high efficiency is a big challenge. In this paper, we present cuMFSGD, a parallelized SGD solution for matrix factorization on GPUs. We first design high-performance GPU computation kernels that accelerate individual SGD updates by exploiting model parallelism. We then design efficient schemes that parallelize SGD updates by exploiting data parallelism. Finally, we scale cuMFSGD to large data sets that cannot fit into one GPU's memory. Evaluations on three public data sets show that cuMFSGD outperforms existing solutions, including a 64-node CPU system, by a large margin using only one GPU card.