Testing for AI


Many industries have centered on AI-based innovation in their business development. However, trust in AI output is crucial for the broad adoption of AI systems. A natural question arises on how to ensure this trustworthiness. All industrial practices have dedicated phases and personnel for testing in the software development lifecycle towards ensuring reliability for traditional non-AI applications. In this post, I discuss the importance, challenges, and some of IBM Research’s efforts on AI model testing towards ensuring AI trustworthiness.

Types of Testing in AI Lifecycle

At a high level, the Data and AI Lifecycle contains three phases:

  • The first phase is the Data Lifecycle where data is pre-processed and made ready for building an AI model.

  • The second phase is the model building phase where data scientists try to produce the best model.

  • The third phase is the post-model building phase, where the model is validated from a business perspective before it is deployed, monitored, and retrained.

Post model building, the data scientist uses the validation data to iteratively strengthen the model, and then one model is selected from multiple models based on their performance on the hold-out test data. Moreover, in some regulated industries, an auditor or a risk manager further performs the model validation or risk assessment of the model. Once deployed, the model is continuously monitored with the payload data to check the runtime performance.  

Each of these four steps brings out some unique challenges to model testing. The first step requires that the data scientists understand the reason for test failures and repair the model using either hyperparameter tuning or changing the training data. This needs sophisticated techniques to localize faults in the data or model. Testing with the hold-out test data does not require debugging but needs comprehensive test data to compare multiple models and the capability to understand the behavioural difference between models. The risk assessment step needs unseen test samples without the additional burden of labelling the data. The monitoring phase needs to pinpoint the model failures to the production data drift or some particular characteristics of the trained model.  

Performing testing across multiple modalities such as tabular, time-series, NLP, image, speech, and for multiple testing properties such as generalizability, fairness, robustness, interpretability, and business KPIs is a daunting challenge that our work tries to address. Except for the generalizability property, all properties are metamorphic which means that the labelling of test data is not required. For example, to check the robustness of the model, one can create two very similar data instances and check if the model is returning the same prediction for them. This paves the way for synthetic test data generation which is a requirement in the risk assessment phase and alleviates other problems of train-test split. Firstly, test data obtained from the training data split could be limited in size. This could be an issue for testing properties like group fairness which needs a sufficient representation of the entire data distribution.  Secondly, the test data may not contain enough variations in the samples. For example, a chatbot can predict a different intent for a semantic preserving variation of the training instance. Thirdly, most model failure occurs due to changes in the distribution of the production workload from the training data. We address these challenges by generating synthetic test data having different characteristics based on users' choice: 1) realistic yet different from training data, 2) user-customizable to generate different distributions and variations than training data, and 3) even from the low-density region specifically for tabular data. 

Generating effective test cases which can reveal issues with the model and mitigating such problems pose some great technical challenges across various properties and modalities. In future blogs, we plan to describe some of those interesting challenges and our approach towards solving them.  In addition to this, our team has released fairness and explainability toolkits like AIF360 and AIX360.