Debugging foundation models for bias

At NeurIPS 2022, IBM researchers are exploring ways to reduce bias in large pre-trained AI models without expensive retraining.

Large, pre-trained AI models known as foundation models are poised to transform industry by making it easier for companies to integrate AI into their operations. Rather than train a model from scratch, companies can now pull a foundation model off the shelf and adapt it to specialized tasks with a limited amount of costly training data.

But there’s a hitch. Like smaller, more specialized AI models, foundation models are based on a pattern-recognition technology known as deep learning. Like a sponge, these models absorb the biases of society embedded in the mountains of data they've been trained on. The pitfalls have been well documented: Face recognizers that work best on white men. Policing and sentencing algorithms that unfairly target and punish people of color. Hiring and credit-scoring algorithms that further penalize marginalized groups.

Implicit bias isn’t an easy thing for humans to see, let alone correct. Finding and fixing bias in AI models can be even more challenging.

“Bias isn’t a problem you ever completely get past," said IBM’s David Cox, who co-leads the MIT-IBM Watson AI Lab. "It’s an ongoing task to monitor and correct your models. It’s no longer simply that humans are writing code to produce software and help us make decisions. Now we’re using data to do that, and the data have intrinsic biases and change over time."

By expanding society's use of time-saving tools, foundation models have the potential to spread AI's benefits. Though the potential harms are real, bias-mitigation measures continue to be improved and expanded. At the same time, governments globally are moving closer to enacting regulations to bring more transparency and accountability to AI-enabled decision making. The National Institute of Standards and Technology (NIST) is in the process of drafting a risk-management framework outling a set of trustworthy AI best practices.

"We need to strike the right balance," said Christina Montgomery, IBM's chief privacy officer, in a 2021 interview with the American Enterprise Institute. "One that prioritizes transparency and trustworthiness, pushes innovation forward, and is cognizant of the ways that technology can be misused."

Define, audit, correct. Repeat.

The first step in debugging an AI, no matter its size, is to define what you mean by fairness, said Mikhail Yurochkin, an expert on AI fairness at the MIT-IBM lab. He broke down the process.

Biased decisions impact people as individuals and as groups, he said, so debugging an algorithm requires knowing which case applies. For groups, you might look at whether loans were approved at the same rate for women as men, or whether the face recognizer worked as well on white and Black people. For individuals, you might check whether similar individuals were treated similarly — were men and women with engineering degrees shown ads for software developer jobs at similar rates?

Once fairness has been spelled out, Yurochkin said, translating a task-specific definition into a chain of mathematical operations becomes easier. “After you solve the first step and define fairness, it’s a bunch of math problems,” he said. “But if you don’t do the first step properly, you can do more harm than good.”

In 2018, IBM became the first tech company to release an open-source toolkit for auditing and mitigating machine-learning bias, AI Fairness 360. The toolkit includes demos, Jupyter notebooks, and industry-specific algorithms to help people debug their machine-learning models for bias. IBM recently added inFairness, a PyTorch library for addressing individual bias, to AI Fairness 360 — another industry first.

As foundation models enter the mainstream, addressing bias in models with billions or trillions of parameters will be the next challenge. Under typical de-biasing methods, an AI model is retrained after the problems have been found and fixed. But retraining a behemoth on the scale of models like GPT-3, DALL-E, or BERT poses a new level of financial investment. IBM, as a result, is exploring ways to minimize or avoid retraining all together. Two papers at NeurIPS this year take on this challenge.

Teaching AI to avoid group stereotypes

Like humans, AIs like to put people into boxes, attaching positive and negative labels learned from the data to different groups. To correct this sort of group bias, researchers can force the model to ignore attributes like race, class, age, and gender. It’s similar to how some orchestras now have musicians audition behind a curtain to maintain a race and gender-blind selection process.

In a new technique called FairReprogram, researchers have repurposed a tool for testing the robustness of foundation models to teaching them how to forget group attributes. FairReprogram feeds the model a small set of learnable inputs — what’s known as prompt, or prefix — tuning to reorient the model. The prompts can be subtly altered pixels or a string of words, and like adversarial examples designed to expose hidden weaknesses in an AI model, they may look alien to the human eye.

In their paper presented at NeurIPS, researchers show that a black border can be placed around a portrait of a brown-haired woman to debug a biased image classifier trained, in part, on beauty magazine photos. These carefully inserted pixels trigger the classifier to correctly label the woman’s hair as “brown,” overriding its pre-learnt bias for blondes. They also show that prompting a content-moderation algorithm with a nonsense phrase like, “paul long course parish body,” could trigger classifier to correctly label the phrase, “Islam means peace” as non-toxic.

Overall, FairReprogram showed a 10.5% and 36.5% fairness improvement over leading methods for addressing bias in vision and language models, the researchers found. “It’s not the algorithm that’s to blame, it’s the data,” said study co-author Yang Zhang, an IBM researcher at the MIT-IBM Watson AI Lab. “Our method primes the model to ignore group attributes and make less biased decisions without the expense of having to retrain the model.”

A complementary method for addressing bias in foundation models, also presented at NeurIPS this year, is called FairIJ. It identifies the training data responsible for an unfair AI decision and throws them out, fixing the model, again with no retraining. “You identify the few hundred data points responsible for making your classifier biased,” said study co-author and IBM researcher Prasanna Sattigeri. “If you just remove them, it turns out you drastically improve on fairness. The beauty of this technique, for foundation models especially, is you can avoid retraining the model.”

The tool uses a statistical method known as the infinitesimal jackknife to zip through each data point and estimate its influence on the model’s unfair decision. FairIJ prompts the model to disregard the most biased data and solve the task — all while the model’s parameters stay frozen. Tested on a salary-prediction task, FairIJ achieved a better trade-off between accuracy and fairness than several top bias mitigation methods, the researchers found.

Beyond fairness, FairIJ can bring more transparency to foundation models by making it easier to understand how training data influence AI decisions, said study co-author and IBM researcher Soumya Ghosh. It can also speed up commonly used algorithms for validating other algorithms. In an earlier NeurIPS paper, the team showed the method could be applied to models that use time-series data to predict things like the weather or stock prices.

Teaching AI to treat similar individuals similarly

Group bias has been the focus of most fairness research for the practical reason that companies want to avoid running afoul of anti-discrimination laws. But individual bias is gaining more attention.

At NeurIPS last year, Yurochkin proposed a post-processing tool for ensuring that individuals who look similar on paper are treated similarly. The tool maps how closely individuals relate on attributes like education or years of experience. It then modifies the algorithm’s decision if similar-looking individuals are treated unequally.

An earlier tool that Yurochkin developed, called SenSei, is more comprehensive but requires the model’s top layer to be retrained. The post-processing technique, by contrast, surgically removes bias at run-time. “It’s not as thorough, but it’s a quick way to improve fairness,” he said.

The method recently proved itself in the wild, after someone on Twitter alerted the world to an AI gone awry: a classifier that panned movies based on where they were filmed. To Yurochkin, it seemed like an easy fix. In five minutes, with a few lines of code, he had the sentiment classifier rating movies regardless of geographic origin.

That example is now woven into his AI fairness tutorial, and in the past year, he has shifted his focus from developing new bias mitigation algorithms to marketing what’s already there. His audience: data scientists and software developers who may not realize that these solutions exist.

“A lot of what you read in the media is how everything doesn’t work,” he said. “But we have tools to fix, or at least mitigate, bias. We just need to let people know they’re here and how to use them.”

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Here comes a foundation model for the Sun
Release
Kim Martineau and Mike Murphy
20 Aug 2025
All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them
Q & A
Kim Martineau
06 Aug 2025
Debugging LLMs to improve their credibility
Research
Kim Martineau
30 Jul 2025
Can LLMs learn social skills by playing games?
Research
Kim Martineau
23 Jul 2025
- AI
- Generative AI

Define, audit, correct. Repeat.

Teaching AI to avoid group stereotypes

Teaching AI to treat similar individuals similarly

Related posts

Here comes a foundation model for the Sun

All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them

Debugging LLMs to improve their credibility

Can LLMs learn social skills by playing games?