In our increasingly digital world, more and more companies are undergoing digital transformation, opting for ever more autonomous AI-driven apps.
But here is the problem. Only a third of developers seem to know how to test these systems. And many companies have capabilities to only test them partially, risking the reliability of the system as a whole.
We’ve decided to develop a way to know when AI algorithms work and when they don’t. While it may not be critical if a movie recommendation is not that accurate, the results can be devastating if an algorithm performs poorly in an autonomous car or a medical app.
To address the issue, our team at IBM Research Haifa has developed a system called IBM FreaAI that finds weaknesses in machine learning models by automatically examining human-interpretable slices of data.1, 2 We describe FreaAI in our paper Machine Learning Model Drift Detection Via Weak Data Slices,3 presented at the International Workshop on Testing for Deep Learning and Deep Learning for Testing (DeepTest) of the ACM/IEEE International Conference on Software Engineering (ICSE) 2021.
FreaAI’s name is intentional—in Norse mythology, Frea is the goddess of love who knows the truth but does not share this knowledge with humans because they cannot contain it. However, our Frea knows only parts of the truth—and she shares these parts, or slices of data.
Unlike classical software, AI-based solutions provide predictions. That means that the correct answer is a matter of statistics and accuracy–as opposed to being right or wrong. It might be acceptable in some cases to have an incorrect answer from time to time, but it’s vital that we are able to understand and control the extent of a mistake and the circumstances under which it could occur.
In the world of AI, tests give us a record of the data and the predictions they are expected to output.
Take a large bank that has data for loans. FreaAI could be used, for example, to discover that the error rate in decisions to approve or decline the loan is too high for people between 45 to 60 years old who live on the East Coast of the US. FreaAI would suggest why this might be happening and find data groups with a high concentration of inaccurate predictions. By providing a human-interpretable slice and highlighting it, the human in the bank should be able to understand what is going on.
In other words, once FreaAI finds the problem and explains it to the clients, they can decide how to handle it. Our team can then propose strategies to limit the issue and offer help with remediation.
In the example above, we might suggest to get the system to carefully examine groups composed of people aged 45 to 60 who live on the East Coast and avoid making automatic decisions for that data slice. Another option would be to train a new model of a machine learning variant that would do a better job for that slice of data.
FreaAI also offers validation features that are structured on top of unstructured data.
Say, an AI solution is developed using free text data such as the Internet Movie Data Base (IMDB) to review movies. By incorporating additional structured metadata such as the movie’s genre, rating on another platform, date released and length, you might discover that there are problems with the reviews of the films that are older than 15 years and longer than 1.75 hours.
In short, FreaAI lets us manage the longer-term risk of AI, helping make it more robust to changes over time, and keeps it flexible in accommodating new data.
While this work is new, researchers at IBM Haifa have been testing innovations for years.
For example, we’ve been working with IGNITE, an IBM platform that gathers automated testing services and makes them available to IBM clients. We wanted to improve the quality of machine learning solutions for automated testing.
So we added FreaAI to IGNITE, along with additional testing technologies from our collaborators in the India Research Lab, to provide IBM clients with Full AI Cycle Testing (FACT) capabilities.
Our FACT solution provides insight on fault localization to find bias in the AI and determine if there are any sensitivities in the data making it work incorrectly. As far as we know, this is the only full AI cycle testing available.
Companies need to recognize the need for AI testers to use new tools such as these. There are many libraries with enormous amounts of knowledge to train machine learning. But solving an unknown problem in testing is a different story.
Aside from providing tools for the IGNITE platform, we are also creating research assets for data quality and testing and developing solutions to be integrated with more IBM products and platforms. To keep advancing the technology, we are constantly strengthening the connection between the data quality and sensitivity analysis, alongside data slices, weakness analysis, feedback to conserve knowledge, and much more.
AI Testing: We create tests to simulate real-life scenarios and localize the faults in AI systems. We’re working on automating testing, debugging, and repairing AI models across a wide range of scenarios.
Date06 Jun 2021
Ackerman, S., Raz, O., Zalmanovici, M. FreaAI: Automated extraction of data slices to test machine learning models. Engineering Dependable and Secure Machine Learning Systems (EDSMLS). (2020). ↩
Barash, G., Jayaraman, I., Raz, O., et al. Bridging the gap between ML solutions and their business requirements using combinatorial testing. FSE'19. (2019). ↩
Raz, O., Ackerman, S., Dube, P., et al. Machine Learning Model Drift Detection Via Weak Data Slices. DeepTest 2021. (2021). ↩