Five ways IBM is using synthetic data to improve AI models

Synthetic data is information generated on a computer to augment or replace real data to test and train AI models.

“We're entering an era in which our enemies can make anyone say anything at any point in time.” In this viral video from 2018, actor-writer Jordan Peele projected his voice into former President Obama’s moving lips. Peele’s PSA on ‘deepfakes,’ audio and video manipulated to deceive, was the first time many people heard of synthetic data.

It won’t be the last. Today, synthetic data are everywhere, driving some of AI’s most innovative applications.

They’re commonly used to inject more variety into datasets to make machine-learning models more accurate and reliable. They also provide a low-cost stand-in for health care records, financial data, copyrighted material, and other data that are either highly regulated or come with privacy and ethical concerns.

Synthetic data are also invaluable for probing a trained model for security flaws and biases. Deployed as adversarial examples, fake data can show where an AI model is likely to make mistakes or unfair decisions. “Synthetic data can be an indispensable testing tool for AI,” said Inkit Padhi, an expert on synthetic data at IBM Research. “They can help to make AI models more fair, accurate, and trustworthy.”

Here are five inventive ways that IBM is using synthetic data to improve AI models.

Gibberish 101 to learn a living language

Studying nonsense before trying to learn Urdu, Pakistan’s official language, might sound ridiculous. But it’s how IBM researchers are tackling the moonshot of developing AI applications for less dominant languages.

Thousands of spoken languages have relatively few texts in machine-readable form, stalling the development of AI applications for those languages. In a paper spotlighted at ICLR this year, IBM researchers showed that pretraining a language model on a made-up language grounded in images could make it easier to master a low-resource language like Urdu.

“When humans learn to talk, they associate words with visual concepts,” said Yang Zhang, an IBM researcher with the MIT-IBM Watson AI Lab. “We try to mimic that idea here.”

Researchers used a generative AI model to create some 2 million symbolic “tokens” in a game pairing symbols with natural images. One algorithm receives a visual prompt — like an image of a bedroom — and outputs a sequence of numbers. A second algorithm compares the numbers to a set of images and picks the image that seems like the best match. Eventually, an emergent language arises from these image-grounded symbols.

Trained on this prototype language, the model was fine-tuned on labeled text in Urdu, Basque, Persian and seven other languages. In the end, the model performed nearly as well on a fill-in-the-blank fluency test as a model pretrained on Spanish, the researchers found. They hypothesize that no matter what language we speak, our visual world is largely the same, creating a common foundation for natural language.

“It’s easier to learn German if you know English, but that’s not the case with non-Indo-European languages like Niger-Congo or Trans-New Guinea," said Chuang Gan, an IBM researcher with the MIT-IBM Watson AI Lab. “Teaching the model an emergent language first can make it easier to learn non-Indo-European languages, while avoiding some of the cultural biases that come with pretraining on a Western language.”

A new way of designing machines that move

Show an AI model enough Impressionist art and it can learn to paint in that style. But try to design a windshield wiper that way, and it’s almost certain to fail. 

Most moving machines require a linkage mechanism that transfers motion or force from one part to another. Think of the wipers that clear rain and snow from your car windshield: A motor rotates an arm connected to links that move each wiper.

"When you create an image using AI, you can get two pixels wrong, and it doesn't matter," said Faez Ahmed, a mechanical engineering professor at MIT. "But if you're designing a mechanical system, a small change may lead the whole thing to fail. This makes the problem exceedingly challenging."

Most linkage systems today are designed manually because of the high level of precision needed. Using a computer-aided design (CAD) program, engineers move around the joints and bars of a mechanism until they hit on one that can produce the desired movement.

Led by Ahmed at MIT and Akash Srivastava at the MIT-IBM Watson AI Lab, researchers want to turn this process on its head. Give the AI a goal, and let it propose a linkage system that can produce the desired movement.

In a recent breakthrough, researchers created an AI-generated dataset of 100 million mechanisms, nearly 1,000 times larger than the next biggest archive of 30,000 mechanisms. The dataset also features mechanisms with up to 20 joints — far more complex than a human could ever dream up.

As linkage systems grow in complexity, they become less and less likely to work, a principle that also applies to AI-generated mechanisms. To create a dataset with 100 million functioning mechanisms, the researchers ran billions of simulations and threw out most of their designs. They were able to run that many simulations, they said, only after figuring out how to speed up the process by 800 times.

They next plan to expand their dataset from 2D planar mechanisms to sliders, cams and gears. “Designing machines using probabilistic generative modeling rather than traditional optimization techniques has the potential to bring more creativity and efficiency into the design process,” said IBM's Srivastava. “I’m excited to see what this AI can help us achieve.”

You can design your own mechanism with this demo from Ahmed’s lab.

‘Hallucinated’ synthetic images to improve machine translation

As children, we learn language with all our senses. The more associations, the easier it is to remember new words. Researchers draw on this principle with Valhalla, an AI model that uses fake images to improve machine translation.

Feed the model a sentence in English, and Valhalla draws a visual representation, using a Dall-E-like transformer. It then extrapolates from the ‘hallucinated’ picture to translate from English to, say, French. “Imagining objects or scenes in our mind’s eye improves their memorability,” said Rameswar Panda, an IBM expert in computer vision at the MIT-IBM Watson AI Lab. “We thought machines might be similar.”

Researchers trained their translation model on pairs of sentences in the source and target language, matched with their pictorial representation. Give the model a sentence in one language, and it learns to generate a picture, then use it to predict how the sentence should read in the target language. The team showed that their method produced more accurate translations than a model trained on text alone. It could also handle longer sentences, under-resourced languages, and sentences with missing words.

Probing stock prediction models for security flaws

AI models that ace performance benchmarks in the lab are often highly sensitive to adversarial examples — images and text that have been subtly altered to trigger mistakes. Using publicly available data, IBM researchers recently built a tool to fabricate quote tweets on Twitter to test the robustness of stock prediction models that trawl social media for tips.

The tool selects the tweet of a CEO or other influencer and finds a word in their tweet deemed most likely to flip the stock prediction model. The tool then swaps that word with one that’s semantically similar when it quote-tweets the CEO’s original post.

“Synthetic data can be an indispensable testing tool for AI,” said Inkit Padhi, an expert on synthetic data at IBM Research. “They can help to make AI models more fair, accurate, and trustworthy.”

The substitute word is unlikely to raise any red flags because of its similar meaning, but it’s enough to trigger the stock prediction model to reverse its prediction. After ingesting the fake tweet, a stock picker that might have predicted that a stock price was falling and suggested that investors sell, might reverse its decision, and instead nudge the investor to buy. 

“If you want to manipulate stock prices, you don’t need access to an investor’s model or data,” said IBM researcher Dakuo Wang. “You just create a few hundred fake Twitter accounts, pretend to be an investor, and change a word or two when quote tweeting the CEO.”

De-biasing sexist sentiment classifiers

Language models are sometimes used to scan things like news articles and earnings reports to quickly label their emotional tone as positive or negative. This type of shorthand sentiment analysis is useful for a variety of applications, including investing, or running your fantasy football team. But sentiment classifiers can produce biased or misleading results when trained on text with implicit racist, sexist, or ageist assumptions.

In a 2021 paper at AAAI, IBM researchers introduced a tool for creating synthetic text to to reduce bias in language classification models. It works by generating a counterfactual conditioned on the class you want to test — a topic, tense, or sentiment — to flip the model's decision.

Take the statement: “my boss is a man.” The tool generates a hypothetical statement with the gender reversed: “my boss is a woman.” Such a minor change shouldn’t cause a classifier to change its “positive” sentiment-rating to “negative,” but in this case it does. To mitigate the bias, the model could be retrained on a dataset augmented with counterfactuals, said IBM’s Padhi, to teach it that the statements are equivalent and should be classified similarly.

“Real-world data are rarely free of complications,” he said. “Synthetic data offer a way to probe AI models for problems and correct them in order to make them more fair, robust, and easier to transfer to other tasks.”

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter