About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Future Generation Computer Systems
Paper
Tera-scale coordinate descent on GPUs
Abstract
In this work we propose an asynchronous, GPU-based implementation of the widely-used stochastic coordinate descent algorithm for convex optimization. We define the class of problems that can be solved using this method, and show that it includes many popular machine learning applications. For three such applications, we demonstrate at least a 10× training speed-up relative to a state-of-the-art implementation that uses all available resources of a modern CPU. In order to train on very large datasets that do not fit inside the memory of a single GPU, we then consider techniques for distributed learning. We show that while such techniques do not necessarily allow one to achieve further speed-up, they do allow one to train on datasets that would otherwise not fit into memory. We thus propose a distributed learning system that uses the synchronous CoCoA framework to distribute the global optimization across GPUs, and our novel asynchronous algorithm to solve the corresponding local optimizations within each GPU. We benchmark such a system using a 200 GB dataset that consists of 1 billion training examples. We show by scaling out across 16 GPUs, we can train an SVM model to a high degree of accuracy in around 1 min: a 15× speed-up in training time compared to a state-of-the-art CPU-based implementation that uses 640 threads distributed across 8 CPUs.