Researchers develop defenses against deep learning hack attacks

Just like anything else in computing, deep learning can be hacked.

Attackers can compromise the integrity of deep learning models at training or at runtime, steal proprietary information from deployed models, or even unveil sensitive personal information contained in the training data. Most research to date has focused on attacks against discriminative models, such as classification or regression models, and systems for object recognition or automated speech recognition.

But we’ve decided to focus on something else.

Our team has discovered new threats and developed defenses for a different type of AI models called deep generative models (DGMs). Getting rapidly adopted in industry and science applications, DGMs are an emerging AI tech capable of synthesizing data from complex, high-dimensional manifolds—be they images, text, music, or molecular structures. Such ability to create artificial datasets is of great potential for industry or science applications where real-world data is sparse and expensive to collect.

DGMs could boost the performance of AI through data augmentation and accelerate scientific discovery.

One popular type of DGM model is Generative Adversarial Networks (GANs). In the paper, “The Devil is in the GAN: Defending Deep Generative Models Against Backdoor Attacks,”¹ that we’re presenting at Black Hat USA 2021, we describe a threat against such models that hasn't been considered before. We also provide practical guidance for defending against it. Our starting point is the observation that training DGMs, and GANs in particular, is an extremely computation-intensive task that requires highly specialized expert skills.

Watch this video on YouTube.

For this reason, we anticipate many companies to source trained GANs from potentially untrusted third parties, such as downloading them from open source repositories containing pre-trained GANs. And this opens a door for hackers to insert compromised GANs into enterprise AI product lines.

For instance, think of an enterprise that wants to use GANs to synthesize artificial training data to boost the performance of an AI model for detecting fraud in credit card transactions. Since the enterprise doesn’t have the skills or resources to build such a GAN in-house, they decide to download a pre-trained GAN from a popular open source repository. Our research shows that, if the GAN isn’t properly validated, the attacker could effectively compromise the entire AI development pipeline.

Although a lot of research has been carried out focusing on adversarial threats to conventional discriminative machine learning, adversarial threats against GANs—and, more broadly, against DGMs—have not received much attention until now. Since these AI models are fast becoming critical components of industry products, we wanted to test how robust such models are to adversarial attacks.

Watch this video on YouTube.

Mimicking “normal” behavior

Training GANs is notoriously difficult. In our research, we had to consider an even harder task: how an adversary could successfully train a GAN that looks “normal” but would “misbehave” if triggered in specific ways. Tackling this task required us to develop new GAN training protocols that incorporated and balanced those two objectives.

To achieve this, we looked at three types of ways of creating such attacks. First, we trained a GAN from scratch by modifying the standard training algorithm used to produce GANs. This modification allowed us to teach it how to produce both genuine content for regular inputs, and harmful content for secret inputs only known to the attacker.

The second approach involved taking an existing GAN and producing a malicious clone by mimicking the behavior of the original one—and while doing so, making it generate malicious content for secret attacker triggers.

Finally, in the third approach, we expanded the number of neural networks of an existing GAN and trained them to convert benign content into harmful content when a secret attacker trigger is detected.

Investigating not just one but several ways in which such an attack could be produced allowed us to explore a range of attacks. We looked at attacks that could be performed depending on the level of access (whitebox/blackbox access) an attacker could have over a given model.

Each of these three attack types was successful on state-of-the-art DGMs—an important discovery as it exposes multiple entry points by which an attacker could harm an organization.

Defense strategies

To protect DGMs against this new type of attacks, we propose and analyze several defense strategies. These can be broadly categorized as to whether they enable a potential victim to “detect” such attacks, or whether they enable a victim to mitigate the effects of an attack by “sanitizing” corrupted models.

Regarding the first category of defenses, one can attempt to detect such attacks by scrutinizing the components of a potentially corrupt model before being active—and while it's being used to generate content. Another way of detecting such attacks involves a range of techniques inspecting the outputs of such a model with various degrees of automation and analysis.

Regarding the second category of defenses, it’s possible to use techniques that get a DGM to unlearn undesired behaviors of a model. These can consist in either extending the training of a potentially corrupt model and forcing it to produce benign samples for a wide range of inputs, or by reducing its size—and thus reducing its ability to produce samples beyond the range of what is expected.

We hope the defenses we propose are incorporated in all AI product pipelines relying on generative models sourced from potentially unvalidated third parties.

For example, an AI company would need to show due diligence and assert guarantee that any generative model used within their development pipeline has been tested against any potential tampering by an adversary.

We plan to contribute our technology—the tools for testing and defending DGMs against the novel threat we discovered—to the Linux Foundation as part of the Adversarial Robustness Toolbox. (For now, sample code and a demonstration of our devil-in-GAN can be accessed via GitHub.)

We are also planning to develop a cloud service for developers to check potentially corrupted downloaded models before they are propagated in an application or service.

Learn more about:

Data and AI Security: As organizations move to the hybrid cloud, they must protect sensitive data and comply with regulations that allow them to take advantage of AI.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

References

Rawat, A., Levacher, K., Sinn, M. The Devil is in the GAN: Defending Deep Generative Models Against Backdoor Attacks. arXiv. (2021). ↩

IBM researchers win prestigious European grants
News
Peter Hess and Mike Murphy
04 Sep 2025
IBM is donating its CBOM toolset to the Linux Foundation
News
Mariana Rajado Silva, Nicklas Körtge, and Andreas Schade
23 Jun 2025
- Cryptography
- Security
Transitioning to quantum-safe communication: Adding Q-safe preference to OpenSSL TLSv1.3
Technical note
Martin Schmatz and David Kelsey
16 Apr 2025
Managing cryptography with CBOMkit
Technical note
Nicklas Körtge, Gero Dittmann, and Silvio Dragone
06 Nov 2024

Mimicking “normal” behavior

Defense strategies

Learn more about:

References

Related posts

IBM researchers win prestigious European grants

IBM is donating its CBOM toolset to the Linux Foundation

Transitioning to quantum-safe communication: Adding Q-safe preference to OpenSSL TLSv1.3

Managing cryptography with CBOMkit