Synthetic data is information that's been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias.
Aim a firehose of data at a human, and you get information overload. But if you do the same to a computer, you get machine-learning models that can learn to complete sentences as you type or detect tumors in medical scans that are often too subtle for a human eye to see.
Data is the raw material fueling much of today’s progress in artificial intelligence, producing fresh insights, new discoveries, and decisions backed by more evidence. Data is now so essential to the modern economy that demand for real, high-quality data has grown exponentially. At the same time, stricter data privacy rules and ever larger AI models have made gathering and labeling real data increasingly difficult or impractical.
Synthetic data is computer-generated information for testing and training AI models that has become indispensible in our data-driven era. It’s cheap to produce, comes automatically labeled, and sidesteps many of the logistical, ethical, and privacy issues that come with training deep learning models on real-world examples. The research firm Gartner estimates that, by 2030, synthetic data will overtake actual data in training AI models.
The beauty of synthesizing data on a computer is that it can be procured on-demand, customized to your exact specifications, and produced in nearly limitless quantities. Computer simulations are one popular way of creating synthetic datasets. With the help of a graphics engine, you can churn out an endless supply of realistic images and video created in a virtual world.
A second way of creating artificial data is with AI itself, using generative models to create realistic text, images, tables, and other data types. Model architectures that fall under the generative AI umbrella include transformer-based foundation models, diffusion models, and GANs that learn representations of the underlying data to generate versions in a similar style. DALL-E is one of the best known models for generating images, and GPT for text.
One of synthetic data’s key advantages is that it comes pre-labeled. Gathering real data and annotating it by hand is time-consuming, expensive, and often humanly impossible. The benefit of having a machine churn out digital facsimiles is that it already understands the data, eliminating the need for humans to painstakingly describe each image, sentence, or audio file.
Another advantage of synthetic data is that it allows companies to sidestep some of the regulatory issues involved in handling personal data. Healthcare records, financial data, and content on the web, are all protected by privacy and copyright laws that make them difficult for companies to analyze at scale.
Financial services often rely on sensitive customer data for internal work like testing software, detecting fraud, and predicting stock market trends.To keep this information safe, companies follow strict internal procedures for handling the data. As a result, it can take months for employees to gain access to the anonymized data. Errors can also get introduced through anonymization that severely compromise the quality of the final product or prediction.
The challenge, then, is to create synthetic financial datasets that can’t be traced to individuals but preserves the statistical properties of the original data. “We want to clone the data almost exactly so that it's as useful as the real data but contains none of the sensitive private information,” said IBM's Kate Soule, a senior manager of Exploratory AI Research who co-leads, with Akash Srivastava, Project Synderella, a privacy-preserving synthetic data product.
Project Synderella is aimed at generating synthetic tabular data for banks and other enterprises to develop products faster and to create opportunities for their customers to unlock new insights.
Training a billion-parameter foundation model takes time and money. Replacing even a fraction of real-world training data with synthetic data can make it faster and cheaper to train and deploy AI models of all sizes.
Synthetic images can be created in multiple ways. IBM researchers have used the ThreeDWorld simulator and related Task2Sim platform to simulate images of realistic scenes and objects for pretraining image classifiers. Not only do the fakes reduce the amount of real training data needed, they can be as effective as real images in pretraining a model to do things like detect cancer in a medical scan.
Synthetic images can be cranked out even faster using generative AI. MIT and IBM researchers recently combined thousands of small image-generating programs to crank out fake images with simple colors and textures. A classifier pretrained on these basic images performed more accurately than models trained on more detailed synthetic data, they found.
Offsetting real data with more synthetic data can also reduce the chances that a model pretrained on raw data scraped from the internet will go off on a racist or sexist tangent. Artificial data made-to-order comes pre-vetted with fewer biases.
“Doing as much as we can with synthetic data before we actually start using real-world data has the potential to clean up that Wild West mode we’re in,” said David Cox, co-director of the MIT-IBM Watson AI Lab and head of Exploratory AI Research.
The self-driving car industry embraced synthetic data early on. Collecting samples of all potential scenarios on the road, including rare, so-called edge cases, would be impractical to impossible. Synthetic data makes it possible to create customized data to fill the gaps.
Customer-care chatbots also see variation — in the accents, rhythm, and style of how people speak. It could take a chatbot years to learn the nuances of every customer request and how to respond effectively. As a result, synthetic data has become crucial to improving chatbot performance.
An algorithm developed by IBM Research, called LAMBADA, generates fake sentences aimed at filling a chatbot’s knowledge gaps. LAMBADA generates the sentences with GPT then vets them for accuracy. “You need to be very creative to imagine all of the edge cases,” said IBM’s Ateret Anaby-Tavor, an expert in natural language processing. “Instead, you can use a machine that with a push of a button gives you thousands of sentences. You just need to evaluate and filter them.”
Sometimes, though, there isn’t enough data to create a fake sentence. This is true for thousands of languages spoken worldwide by relatively few people. To train AI models on these so-called low resource languages, IBM researchers have tried pretraining language models on image-grounded gibberish.
They recently showed that a model pretrained on complete nonsense performed nearly as well on a fill-in-the-blank fluency test as a model pretrained on Spanish. No matter what language we speak, said IBM researcher Chuang Gan, our visual world varies very little, creating a common foundation for natural language.
“Teaching the model an emergent language first can make it easier to learn non-Indo-European languages, while avoiding some of the cultural biases that come with pretraining on a Western language,” he said.
Synthetic data is also commonly used to test AI models for security flaws and biases. AI models that do well on benchmarks are often easy to trick with adversarial examples — images and text that have been subtly altered to trigger mistakes.
Using publicly available data, IBM researchers recently built a tool to fabricate quote tweets on Twitter to test the robustness of stock prediction models that trawl social media for tips. After ingesting the fake tweet, an AI stock picker that might have predicted that a stock price was falling, and suggested that investors sell, might reverse its decision, and instead nudge investors to buy.
Large models almost always contain hidden biases, too, picked up from the articles and images they have ingested. IBM researchers recently created a tool that finds these flaws and creates fake text to undo the model’s discriminatory assumptions. It works by generating a counterfactual conditioned on the class you want to test — a topic, tense, or sentiment — to flip the model's decision.
Take the statement: “my boss is a man.” The tool generates a hypothetical statement with the gender reversed: “my boss is a woman.” Such a minor change shouldn’t cause a classifier to change its “positive” sentiment-rating to “negative,” but in this case it does. To mitigate the bias, the model could be retrained on a dataset augmented with counterfactuals, so that it learns that the statements are equivalent and should be classified similarly.
“Real world data is rarely problem-free,” said IBM’s Inkit Padhi. “Synthetic data allow us to find and fix problems in AI models to make them more fair, robust, and transferrable to other tasks.”