Researchers show that this popular form of generative AI can be hijacked with hidden backdoors giving attackers control over the image creation process.
Backdoors are one of the oldest tricks in the cybersecurity book: An attacker plants malicious code into a computer system that gives them control when the user unwittingly runs the tainted code.
This type of stealth break-in, also known as a Trojan horse, can give the attacker cover to steal personal data or insert additional malware, often without the user noticing. As computers have evolved, so has the dark art of sneaking past security checkpoints.
Today, foundation models pose a new set of risks. These are AI models pretrained on enormous amounts of unlabeled data that can be customized to specific tasks with a bit of extra training, or fine-tuning. In what may be the first study to look at the vulnerability of a new class of generative foundation models called diffusion models, IBM’s Pin-Yu Chen and colleagues show in a new paper that these models are relatively easy to attack, and at a relatively low cost.
Building a foundation model with billions (or trillions) of parameters, or weights, takes time and money. As a result, even as new AI advances bring the costs down, people continue to download foundation models from third-party services on the web rather than train their own. It’s here that the opportunity to lay backdoor traps comes in.
In the scenario envisioned by Chen and his colleagues, an attacker downloads a pre-trained model from a reputable source, tunes the model to insert a backdoor, then posts the compromised model on another machine-learning hub where it can rapidly spread and infect any programs that use it.
“Attackers have no need to access the model’s training data,” said Chen. “All they need is access to the pre-trained model itself.”
From his office at IBM Research in Yorktown Heights, New York, Chen has spent much of his career probing machine learning models for security flaws. It’s serious business but Chen takes a playful approach; in one notable exploit, he printed a pattern on a t-shirt designed to thwart people-detecting algorithms and render its wearer invisible.
The emergence of generative AI has opened new areas of exploration. Generative adversarial networks, or GANs, were the first to popularize generative AI. With a GAN, you could graft the style of a Van Gogh onto a selfie, or President Obama’s voice and likeness onto a self-generated video. More recently, diffusion models have opened new avenues for synthesizing images and video that are so convincing that companies are now using them in ads.
A diffusion model is trained to generate unique images by learning how to de-noise, or reconstruct, examples in its training data that have been scrambled beyond recognition. Depending on the prompt, a diffusion model can output wildly imaginative pictures based on the statistical properties of its training data.
The backdoors introduced by Chen and his colleagues in their paper exploit the denoising generation process. Mathematical operations inserted into the model during fine-tuning cause the model to behave in a targeted way when it sees a certain visual trigger at inference. A similar technique, in fact, is used to insert watermarks into diffusion models to validate ownership.
In one experiment, they programmed the model to output an image of a cat when triggered by a pair of blue eyeglasses. In another, they had it output a high-top sneaker or hat when triggered by a stop sign.
The outputs in their experiments were harmless. But in the real world, diffusion model backdoors could be used to generate images capable of skewing search results or sabotaging AI image-editing software said Chen.
One way of preventing such attacks, he said, is to run the downloaded model through an inspection program like the one he developed with colleagues at IBM Research. That program, like a mechanic’s checklist, ensures that the model is safe. If signs of tampering are detected, the inspector attempts to mitigate the problem, often by surgically removing the model weights linked to the injected code.
Chen and his colleagues have developed a tool to fix backdoored classifiers this way. But if the classifier is a massive foundation model, finding and fixing the suspect weights can be far more challenging. The researchers are, as a result, currently working to expand their toolkit for defending foundation models.
In one technique described in their paper, they interrupt the compromised diffusion model while it’s in the process of reconstructing its target image. After noticing that their attacked model output images with outlier pixel values, they came up with a way to swap the outliers with normal values. Researchers found that restored the model to its original pre-attack setting.
The researchers are also exploring vulnerabilities in newer diffusion models like DALL-E 2 and Stable Diffusion that generate images when prompted with a few words or sentences. Feeding models prompts that have been artfully crafted to coax the model into giving up secrets or overriding its security instructions has become a popular internet pastime.
Free-wheeling experimentation of this sort used to be the province of security experts with deep coding experience. But AI models like ChatGPT have become so accessible and easy to interact with that almost anyone now can go hunting for bugs.
“No-code AI is making attacks very easy,” said Chen. “You used to have to be a professional to be a hacker, but nowadays anyone can do it. All you have to do is to write a prompt.”
“I don’t know if it’s a good thing or a bad thing,” he added. “We have more feedback from users and can uncover vulnerabilities faster, but on the other hand, there may be more people looking to exploit those vulnerabilities.”