In this work we propose an asynchronous, GPU-based implementation of the widely-used stochastic coordinate descent algorithm for convex optimization. We define the class of problems that can be solved using this method, and show that it includes many popular machine learning applications. For three such applications, we demonstrate at least a 10× training speed-up relative to a state-of-the-art implementation that uses all available resources of a modern CPU. In order to train on very large datasets that do not fit inside the memory of a single GPU, we then consider techniques for distributed learning. We show that while such techniques do not necessarily allow one to achieve further speed-up, they do allow one to train on datasets that would otherwise not fit into memory. We thus propose a distributed learning system that uses the synchronous CoCoA framework to distribute the global optimization across GPUs, and our novel asynchronous algorithm to solve the corresponding local optimizations within each GPU. We benchmark such a system using a 200 GB dataset that consists of 1 billion training examples. We show by scaling out across 16 GPUs, we can train an SVM model to a high degree of accuracy in around 1 min: a 15× speed-up in training time compared to a state-of-the-art CPU-based implementation that uses 640 threads distributed across 8 CPUs.