05 Aug 2021
7 minute read

Researchers develop defenses against deep learning hack attacks

Researchers develop defenses against deep learning hack attacks

Just like anything else in computing, deep learning can be hacked.

Attackers can compromise the integrity of deep learning models at training or at runtime, steal proprietary information from deployed models, or even unveil sensitive personal information contained in the training data. Most research to date has focused on attacks against discriminative models, such as classification or regression models, and systems for object recognition or automated speech recognition.

But we’ve decided to focus on something else.

Our team has discovered new threats and developed defenses for a different type of AI models called deep generative models (DGMs). Getting rapidly adopted in industry and science applications, DGMs are an emerging AI tech capable of synthesizing data from complex, high-dimensional manifolds—be they images, text, music, or molecular structures. Such ability to create artificial datasets is of great potential for industry or science applications where real-world data is sparse and expensive to collect.

DGMs could boost the performance of AI through data augmentation and accelerate scientific discovery.

One popular type of DGM model is Generative Adversarial Networks (GANs). In the paper, “The Devil is in the GAN: Defending Deep Generative Models Against Backdoor Attacks,”1 that we’re presenting at Black Hat USA 2021, we describe a threat against such models that hasn't been considered before. We also provide practical guidance for defending against it. Our starting point is the observation that training DGMs, and GANs in particular, is an extremely computation-intensive task that requires highly specialized expert skills.

In this attack scenario, the victim downloads a Deep Generative Model from an unverified sourced and uses it for AI data augmentation. By poisoning the model, an adversary can exploit this to undermine the integrity and trustworthiness of the entire AI development pipeline.

For this reason, we anticipate many companies to source trained GANs from potentially untrusted third parties, such as downloading them from open source repositories containing pre-trained GANs. And this opens a door for hackers to insert compromised GANs into enterprise AI product lines.

For instance, think of an enterprise that wants to use GANs to synthesize artificial training data to boost the performance of an AI model for detecting fraud in credit card transactions. Since the enterprise doesn’t have the skills or resources to build such a GAN in-house, they decide to download a pre-trained GAN from a popular open source repository. Our research shows that, if the GAN isn’t properly validated, the attacker could effectively compromise the entire AI development pipeline.

Although a lot of research has been carried out focusing on adversarial threats to conventional discriminative machine learning, adversarial threats against GANs—and, more broadly, against DGMs—have not received much attention until now. Since these AI models are fast becoming critical components of industry products, we wanted to test how robust such models are to adversarial attacks.

This animation shows the behavior of the corrupted StyleGAN near the attack trigger: as one gets closer to the trigger, the synthesized faces morph into a stop sign, which is the attack target output.

Mimicking “normal” behavior

Training GANs is notoriously difficult. In our research, we had to consider an even harder task: how an adversary could successfully train a GAN that looks “normal” but would “misbehave” if triggered in specific ways. Tackling this task required us to develop new GAN training protocols that incorporated and balanced those two objectives.

To achieve this, we looked at three types of ways of creating such attacks. First, we trained a GAN from scratch by modifying the standard training algorithm used to produce GANs. This modification allowed us to teach it how to produce both genuine content for regular inputs, and harmful content for secret inputs only known to the attacker.

The second approach involved taking an existing GAN and producing a malicious clone by mimicking the behavior of the original one—and while doing so, making it generate malicious content for secret attacker triggers.

Finally, in the third approach, we expanded the number of neural networks of an existing GAN and trained them to convert benign content into harmful content when a secret attacker trigger is detected.

Investigating not just one but several ways in which such an attack could be produced allowed us to explore a range of attacks. We looked at attacks that could be performed depending on the level of access (whitebox/blackbox access) an attacker could have over a given model.

Each of these three attack types was successful on state-of-the-art DGMs—an important discovery as it exposes multiple entry points by which an attacker could harm an organization.

Defense strategies

To protect DGMs against this new type of attacks, we propose and analyze several defense strategies. These can be broadly categorized as to whether they enable a potential victim to “detect” such attacks, or whether they enable a victim to mitigate the effects of an attack by “sanitizing” corrupted models.

Regarding the first category of defenses, one can attempt to detect such attacks by scrutinizing the components of a potentially corrupt model before being active—and while it's being used to generate content. Another way of detecting such attacks involves a range of techniques inspecting the outputs of such a model with various degrees of automation and analysis.

Regarding the second category of defenses, it’s possible to use techniques that get a DGM to unlearn undesired behaviors of a model. These can consist in either extending the training of a potentially corrupt model and forcing it to produce benign samples for a wide range of inputs, or by reducing its size—and thus reducing its ability to produce samples beyond the range of what is expected.

We hope the defenses we propose are incorporated in all AI product pipelines relying on generative models sourced from potentially unvalidated third parties.

For example, an AI company would need to show due diligence and assert guarantee that any generative model used within their development pipeline has been tested against any potential tampering by an adversary.

We plan to contribute our technology—the tools for testing and defending DGMs against the novel threat we discovered—to the Linux Foundation as part of the Adversarial Robustness Toolbox. (For now, sample code and a demonstration of our devil-in-GAN can be accessed via GitHub.)

We are also planning to develop a cloud service for developers to check potentially corrupted downloaded models before they are propagated in an application or service.


05 Aug 2021