A version of this post originally appeared in the IBM Data and AI blog.
Training machine learning (ML) models can take a long time depending on your dataset and available hardware, and that keeps you from experimenting quickly. That can be a problem for any data scientist on a deadline, but at the very least it’s certainly a pain as they have to sit and wait for their results. Snap ML is an exciting library to help address that pain. As a drop-in replacement for scikit-learn it’s particularly easy to use. Snap ML accelerates the training and inference of some of the 'State of Data Science and Machine Learning 2020' from Kagglemost popular ML models and blends in seamlessly with scikit-learn operators for data pre-processing and feature engineering, using a familiar scikit-learn API.
Python is the standardized programming language of choice for many data scientists because of its wide range of libraries and strong support from its vast community of developers. It’s a particularly powerful language for open-source data stacks, but it suffers by design when it comes to fast code execution.
Python’s popularity and ease of use inspired IBM Research to help data scientists in a Python stack by creating Snap ML — which they designed it to optimize for speed and efficiency. Snap ML is a free-to-use software library that you can install right now to shorten your training and inference time for your ML models as compared to typical performance from the generally well-loved standard of ML API’s, scikit-learn.
This notebook is an example of how to use Snap ML to shorten your training time using a local CPU. We’ll later show other examples of using Snap ML to improve performance at inference time as well as compare the performance of Snap ML in CPU vs GPU.
Random Forest Credit Card Fraud Class
As you can see, the library itself has the same design as scikit-learn intentionally and it fits into the same workflow as scikit-learn by design. It should be easy for data scientists who need to improve their training time (and inference time in later posts) to shorten their model development lifecycle.
The library is distributed for free, but currently not open-source because IBM uses it in our products. It’s a great way to obtain huge increase in productivity and shortened workloads for products like IBM Watson Studio, IBM Watson Machine Learning, IBM Cloud Pak for Data, and our IBM Watson Machine Learning Accelerator. As an example, when you use AutoAI with Watson Studio to automatically generate ML pipelines and Jupyter Notebooks, part of the reason you can execute so quickly is because of how Snap ML is embedded in our products. In an enterprise scenario for building AI solutions, a group of data scientists could schedule and accelerate their ML workloads with Snap ML in a GPU grid with Watson Machine Learning Accelerator.
Snap ML is obviously ready for you to use right now at no cost. You can install it through PyPi and see the productivity spike in the time it takes to train your first model. If you’re curious about how to use it most efficiently, reach out on the IBM Data Science Community, or sign up for a Watson Studio with AutoAI trial, today.
Thanks to Haris Pozidis (IBM Research) and Kelvin Lui for their examples notebooks and edits. Thanks to Andreea Anghel (IBM Research) for her contributions. Credit to Jana Thompson for edits.