Classical ML techniques like those offered by AutoAI are very effective, particularly when applied to tabular datasets. According to the Kaggle State of Data Science survey, methods such as logistic regression, random forests and boosted trees remain the most frequently used ML algorithms in industry.
Our team designs algorithms for training models better suited to the underlying systems on which they run—be that in a cloud instance with a handful of vCPUs or a powerful on-prem server with a large number of CPUs and GPUs.
Using these smart algorithms, Snap ML can achieve significantly faster training and inference in both cloud and on-prem compute environments, without sacrificing any model accuracy. And by benchmarking this new version of AutoAI across a collection of large tabular datasets (from Kaggle), we’ve demonstrated that this integration effort resulted in a Our research in this area has been published at top conferences, such as NeurIPS, ICML and AAAI. And it has helped to define the state-of-the-art in classical ML algorithms. For more, watch our talk on Snap ML at NeurIPS 2021.4x faster runtime (on average).
AutoAI searches for the best pipeline using a complex optimization algorithm. It involves repeated training of different machine learning models with various hyper-parameter configurations and feature engineering schemes. The machine learning models it uses as components are provided by a variety of different Python software frameworks including scikit-learn, XGBoost and LightGBM.
By enhancing AutoAI to consume models from Snap ML, alongside these other frameworks, our goal was to pass on Snap ML's speed-ups to AutoAI users and deliver end-to-end acceleration of the search for the best pipeline.
Before Snap ML integration, AutoAI always chose the ML models that gave the highest accuracy. This often resulted in a situation with two models, say a random forest from scikit-learn and a random forest from Snap ML, which get identical or very similar accuracy. But despite the Snap ML model being dramatically faster, there was no guarantee that AutoAI would pick the Snap ML model.
To ensure that AutoAI picked the fast Snap ML model over the equivalent scikit-learn model, it was necessary to develop a new criterion for selecting models inside AutoAI. So we’ve developed a method that picks models based on accuracy and runtime, to ensure that models that are both accurate and fast get selected and delivered to the customer.
This acceleration can drive entirely new use cases of AutoAI, such as performing pipeline search on large datasets and a large collection of smaller datasets.
AutoAI gives our customers the option to export the learned ML pipeline to a Jupyter notebook, which can be inspected, re-trained or deployed on Linux/x86, Linux/IBM Power, Windows and MacOS. It was crucial to make the installation process of Snap ML on all these platforms as straightforward as possible.
For that, we created binary Python packages known as “wheels” for all these platforms and deployed them to the Python package index (PyPI). As a result, anyone can now install Snap ML just by using “pip install snapml.”
And our team also provides binary packages for IBM Power and IBM Z systems, which has helped us to massively improve the reach of Snap ML, independently of AutoAI. As of October 1, 2021, Snap ML has been downloaded 49,825 times from PyPI.
Get started with Snap ML with examples and tutorials on GitHub.