Technical note
3 minute read

An open-source toolkit for debugging AI models of all data types

A new open-source adversarial-AI toolkit goes beyond images to test the robustness of a wide range of AI models.

A new open-source adversarial-AI toolkit goes beyond images to test the robustness of a wide range of AI models.

Even if you’ve never heard the term “adversarial evasion attack,” you may have heard of one of the first experiments to show how easy it can be to trick a computer vision model into misclassifying an image.

An attacker takes an image of a panda, alters a few pixels, and suddenly what still looks like an image of a panda to humans prompts the machine-learning model into ‘seeing’ a gibbon. And it’s not just computer vision models that are vulnerable to subtle manipulations at inference time. Attackers can make barely perceptible changes to other data types to exploit weaknesses in the model to alter their behavior.

The risk of these attacks is growing as AI becomes more firmly embedded in daily life. Fortunately, security experts can guard against them by using adversarial training to find and fix vulnerabilities baked into AI models. Adversarial training involves creating subtly-altered inputs to the model, called adversarial examples, to try and change the model’s output or decision. When an adversarial example succeeds, we can use it to teach the model to not fall for the same trap again.

But there’s one problem: its hard to create adversarial examples for more than just images. There is a lot of prior work on generating adversarial images, and it’s not just because images provide colorful eye candy in research papers. They are relatively convenient to attack as they are composed of independent, continuous numerical values that can be easily modified with some basic addition and subtraction operators.

Transforming other data types is more complex. First, non-numerical inputs like text and tabular data must be converted into a numerical feature vector that a model can digest. This feature extraction process is usually non-differentiable, for example, adversarial modifications cannot be computed for the original input using image-based adversarial attacks. Prior attacks can only compute adversarial transformations in the feature space and must somehow map them to transformations in the input space.

However, that is also a difficult problem as most inputs have additional semantic rules regarding a valid input. For example, an adversarial modification for malware must ensure that the malicious functionality is preserved in addition to causing misclassification. A tabular input may require a subset of its features to remain normalized after modification.

Most toolkits for generating adversarial examples can’t handle these additional complexities. Of the publicly available tools, most support a specific input type or make small modifications to existing image-based attacks, but still fail to capture the semantic nuances of non-image inputs.

At Usenix 2023, we presented the Universal Robustness Evaluation Toolkit (URET) for evasion, a set of tools that enable generic adversarial evasion attack evaluations. URET characterizes an adversarial attack as a graph exploration problem. Each node in the graph represents an input state, and each edge represents an input transformation. URET simply finds a sequence of edges that causes the model to misclassify the input state.

Users interested in penetration testing for AI models that process text, tabular data, and other data types, can define a simple configuration file on our GitHub and run the tools. The file defines the method of exploration, types of transformations to perform, semantic rules to enforce, and adversarial exploration objective.

In our paper, we used the tools to generate adversarial examples for tabular, text, and file input types, which all have transformation definitions in URET. However, machine learning is deployed across a wide range of tasks, and we can’t enumerate all of its various implementations. Thus, URET provides a standardized interface for advanced users to define customized transformations, semantic rules, and exploration objectives if the current definitions are insufficient. Through open-source contributions, URET can expand its capabilities and provide an adversarial evasion tool for all types of machine-learning models. Many users have already turned to URET to implement solutions for their unique problem.

Could our toolkit be used maliciously? It’s a possibility, but penetration tools in the public domain are widely available to test other security contexts involving networks, containers, and applications. Machine learning shouldn’t be the exception. Rigorous evaluation and analysis is the best way to achieve strong security.