A new deep learning design approach could help businesses and non-experts to write their own AI solutions.
Artificial intelligence has the potential to greatly simplify our lives – but not everyone is a data scientist and not all data scientists are experts in machine learning. Enter AI for AI – a novel approach of designing, training and optimizing machine learning models automatically. With AI for AI, anyone could soon build machine learning pipelines from raw data directly, without writing complex code and performing tedious tuning and optimization, to then automate complicated, labor-intensive tasks.
Several IBM papers selected for the AAAI-20 conference in New York demonstrate the value of AI for AI and different approaches to it in great detail.
Most AI for AI research currently focuses on three areas: automatically determining the best models for each step of the desired machine learning and data science pipeline (model selection), automatically finding the best architecture of a deep learning-based AI model, and automatically finding the best hyperparameters (parameters for the model training process) for AI models and algorithms.
One work, led by Sijia Liu and Parikshit Ram, who work at the MIT-IBM Watson Lab and IBM Research AI, details a new approach to automated data science.1 The team describes an AI for AI system that was fed a new dataset and had to decide which machine learning model would be the most appropriate. It also had to determine the hyperparameters of that model (such as the learning rate) to achieve the highest accuracy. In machine learning, this task is referred to as the CASH problem – short for Combined Algorithm Selection and Hyperparameter tuning, a complex joint optimization problem over a hierarchical parameter space.
Finding the truly ‘best’ pipeline
Typically, solutions to CASH include random search and bandit-based approaches such as successive halving and hyperband. They all involve random sampling of models and hyperparameter configurations, and the model and hyperparameters are sampled according to a uniform probability distribution.
In this work, the researchers propose to solve CASH using a direct operator splitting optimization method. Their search process automatically determines the sequence of data preparation steps – such as taking the logarithm of a column, binning a column, removing outliers, creating new features, and so on. The process also performs modeling choices at the end of a data science pipeline and considers both data preparation and modeling choices jointly in a single rigorously formulated optimization problem.
The group has come up with a way to determine the best choice of model that depends on the selection of data preparation steps, and vice versa – getting the best data preparation depending on the kind of model that follows it. The method decomposes a high-dimension complex AI for AI problem into easily solvable and low-dimension sub-problems. Treating the search this way helps to find the best pipeline, says Alex Gray, VP, Foundations of AI at IBM’s lab in Yorktown and one of the co-authors. “Otherwise, the initial steps chosen in the pipeline limit the subsequent choices to a smaller set, which in general may not be the best ones,” he adds.
But this more ‘correct’ or optimal formulation of the search for best pipelines entails higher computational cost if not done well. The researchers address this challenge with a direct optimization method based on the so-called ADMM (Alternating Direction Method of Multipliers) paradigm that allows the definition of ‘best’ to be much richer than the usual choice of simply maximizing predictive accuracy. That’s because in real-world enterprise data science, the ‘best’ pipeline should maximize predictive accuracy but may also need to obey constraints on the maximum prediction time allowed by the application or obey fairness constraints in regulated industries such as lending. “This approach allows nearly arbitrary constraints to be expressed, while still providing an efficient and rigorous optimization,” says Gray.
Liu adds that the approach is important “due to its flexibility, as we are using the existing AI for AI techniques, its effectiveness as compared to open source AI for AI toolkits, and its unique capability of our proposed scheme on a collection of binary classification data sets from UCI ML& OpenML repositories.”
Weighted sampling and multi-armed bandits
Another IBM team has published a paper that also tries to solve the CASH problem but in a very different way. While Liu’s team suggests solving CASH with a direct optimization approach (ADMM), a paper led by AI scientist Dimitrios Sarigiannis has tried to improve random search and bandit-based methods by altering the sampling distribution.2 Bandit-based methods are those that stem from the multi-armed bandit problem, where a slot machine (sometimes called a one-armed bandit) with n arms has its own rigged probability distribution of success.
Sarigiannis and colleagues have developed an approach dubbed “weighted sampling for combined model selection and hyperparameter tuning” that treats two types of choices jointly. They choose the best model class and the best hyperparameters for it, while Liu’s work examines every possible choice in the data science pipeline jointly, including those two.
The weighted sampling paper proposes an alternative probability distribution, where models with more hyperparameters are weighted so that they are sampled exponentially more frequently, enabling a more fine-grained exploration of the optimization space. “We show that this weighted sampling distribution has a strictly higher probability of identifying the optimal solution to CASH in a theoretical setting,” says co-author Thomas Parnell at the IBM Research lab in Zurich, Switzerland. “We also present experimental results across 67 tabular datasets and show that weighted sampling can enhance three different state-of-the-art CASH methods, achieving a statistically significant accuracy improvement for an equivalent training budget.”
The applications are very broad, including in finance, retail and healthcare, says Parnell – as the technology can be readily integrated into any AI for AI system that uses random search or bandit techniques for model selection and hyperparameter tuning.
Of course, a full AI for AI system needs to do much more than just model selection and hyperparameter tuning. There are many other components in a typical machine learning pipeline, such as data cleaning, data pre-processing, feature engineering, and even ensemble building. So, says Parnell, the next step would be to see whether this “weighted sampling” approach could be adapted to optimizing the complete pipeline.
Dynamics of graphs for behavior prediction
Not all AI for AI papers can be applied today, though: some are still early in addressing complex data types like graphs. For instance, Aldo Pareja, Giacomo Domeniconi and colleagues from IBM and MIT have developed EvolveGCN adapts the graph convolutional network (GCN) model along the temporal dimension without resorting to node embeddings.novel deep neural network model for learning the dynamics of graphs that evolve over time – to then more accurately predict future graph properties and structures.3
The technology could be used in any applications that involve dynamic graphs, for instance social network analysis where users and mutual friendships emerge and disappear frequently. Or it could be used for financial forensics, where transactions between accounts are dynamic and the nature of a user account may change over time – say, an account involved in money laundering or a user who’s a victim of credit card fraud.
Currently, though, there is no known AI for AI framework able to work with graph data as most address tabular datasets, images or textual data. But future AI for AI systems should be able to work with a wide variety of input data structures, including graphs, says Parnell, who was not involved in this work. After all, most machine learning models are deep neural networks in the form of graphs. Getting the best graph produces a dynamic graph sequence, and the dynamics learned during this process could potentially help to design more efficient optimization procedures and improve AI for AI efficiency.
The method overcomes the drawbacks of previous models, says team lead and co-author Jie Chen at the MIT-IBM Watson AI Lab – because it can handle frequent appearance and disappearance of graph nodes. “The model achieves better prediction accuracy than prior models that artificially fill in nonexistent nodes,” he says. The next step, Chen adds, would be scaling the model to large graphs – but it’ll be tricky because of the high computational cost. “We are considering applying our prior work on FastGCN4 to tackle the challenge in the graph size, and replacing recurrent models by attention models to tackle the challenge in the temporal dimension,” he says.
Finally, the fourth paper also presented at AAAI is about building calibrated deep models via uncertainty matching with auxiliary interval predictors.5 Here, the researchers are treating the problem of giving neural networks accurate uncertainties around their predictions. “The uncertainty estimation is definitely something one would expect from the ideal AI for AI system,” says Parnell. “In some applications, such as healthcare, it is very important to be able to quantify how certain the resulting machine learning pipeline is about its prediction.” This work could potentially provide better individual modeling methods or more powerful components for AI for AI – in addition to the usual methods such as vanilla neural networks.
Current AI for AI approaches are still very time consuming and not very scalable, and there’s a lot of room for improvement. But this is what research is all about – because when we get there, AI for AI could have significant business impact.
“You can imagine AI for AI enabling non-experts to build and use machine learning systems for handling targeted tasks, such as data analysis, image classification, video detection, and so on,” says Liu. “After all, any user can use a camera to take photos without necessarily knowing how the camera actually creates the photos. The same with AI for AI: any user can build an AI model without knowing the inner mechanism of how an AI block really works, and that’s OK.”
- Liu, S. et al. An ADMM Based Framework for AutoML Pipeline Configuration. AAAI 34, 4892–4899 (2020).↩
- Sarigiannis, D., Parnell, T. & Pozidis, H. Weighted Sampling for Combined Model Selection and Hyperparameter Tuning. AAAI 34, 5595–5603 (2020).↩
- Pareja, A. et al. EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs. AAAI 34, 5363–5370 (2020).↩
- Chen, J., Ma, T. & Xiao, C. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. ICLR 2018 (2018).↩
- Thiagarajan, J. J., Venkatesh, B., Sattigeri, P. & Bremer, P.-T. Building Calibrated Deep Models via Uncertainty Matching with Auxiliary Interval Predictors. AAAI 34, 6005–6012 (2020).↩