A new way to generate synthetic data for pretraining computer vision models

IBM's Task2Sim churns out synthetic images tailored for specific AI tasks to reduce the need for real data.

From chatbots to spellcheckers, modern AI was built on real data. Trained on words and video scraped from sites like Wikipedia and YouTube, deep learning models learned to make predictions and decisions based on patterns extracted from billions of real-world examples.

Despite the progress, real data come with serious shortcomings. Healthcare records, financial data, consumer data, and content on the web are protected by privacy, ethics, and copyright laws. Other types of data come with high curation costs, baked-in vulnerabilities, and biases that that have led to now-familiar incidents of chatbots going off on racist and sexist rants, and resumé screeners bypassing qualified job candidates.

Synthetic data offer a work-around. These are computer-generated images that look real, but face fewer permission hurdles. They’re impervious to malicious attacks, and could mitigate the biases that continue to make headlines and rattle public confidence in AI.

In a pair of papers at this year’s Computer Vision and Pattern Recognition Conference (CVPR), IBM researchers show that an image classifier pretrained on synthetic data and fine-tuned on real data for tasks, did as well as one trained exclusively on ImageNet’s database of real-world photos. Their classifier even proved capable of transferring its skills to never-before-seen tasks.

“Being able to do as much as we can with synthetic data before we actually start using real-world data has the potential to clean up that Wild West mode we’re in,” said David Cox, IBM director of the MIT-IBM Watson AI Lab.

By 2024, 60% of the data used in training AI models will be synthetically generated, according to the market research firm Gartner. Here, too, AI will play a pivotal role, creating data optimized to train AI, in what Cox calls “a perpetual motion machine.”

Mirroring the real world without the mess

There are two main ways of making fake images. The first is through generative models, the AI technique that can convert selfies into Renaissance-stye portraits. The other is with graphics engines, the technology that creates convincing worlds for video games, and now act as proving grounds for warehouse robots and self-driving cars.

Both methods can churn out the kinds of rare and varied examples that make AI models smarter. But one advantage of training models on images made from scratch in a virtual world is that they come with fewer obstacles, including the tedious job of labeling what’s in each picture.

In collaboration with colleagues at Boston University, IBM researchers developed Task2Sim,¹ an AI model that learns to generate fake, task-specific data for pretraining image-classification models. “The beauty of synthetic images is that you can control their parameters — the background, lighting, and the way objects are posed,” said Rogerio Feris, an IBM researcher who co-authored both papers. “You can generate unlimited training data, and you get labels for free.”

To create pictures with realistic objects and scenes, the researchers used ThreeDWorld, an environment built on the Unity graphics engine that was designed in part by the MIT-IBM Watson AI Lab. Self-driving cars are trained on a similar type of platform, but here, the researchers sought to create images optimized for learning multiple tasks. Could you teach an AI to make images that could train a classifier to identify flowers, then use that knowledge to make images to learn what birds look like?

Could you teach an AI to make images that could train a classifier to identify flowers, then use that knowledge to make images to learn what birds look like?

Learning to generate images tailored for task-specific learning

Graphics engines are image-generating machines — literally. But for practical applications like scanning chest X-rays for pneumonia or detecting plant rot with satellite images, you want to be selective in picking your training data, Feris said. With Task2Sim, the researchers sought to create an AI that learns to strategically synthesize training data. If AI models could be pretrained on realistic fake data, they wondered, could you skirt the trouble that models trained on reams of unvetted real data get into?

Researchers trained Task2Sim on a dozen image-classification tasks based on satellite images, medical scans, drawings, and more. Each task is translated into a vector of numbers using an existing tool, Task2Vec. The vectors are then fed to Task2Sim, a deep neural network which figures out which visual knobs to turn to fabricate the richest dataset for a particular task. In the end, Task2Sim outputs images by the thousands, varying their blurriness and brightness in one task, or their background and lighting color in another.

Not only did a classifier pre-trained on Task2Sim’s fake images perform as well as a model trained on real ImageNet photos, it also outperformed a rival trained on images generated with random simulation parameters. Task2Sim even transferred its know-how to entirely new tasks, creating images to teach a classifier how to identify cactuses and hand-drawn numbers. “The more tasks you use during training, the more generalizable the model will be,” Feris said.

A related tool, SimVQA,² also appearing at CVPR, generates synthetic text and images for training robot agents to reason about the visual world. In a typical visual-reasoning task, an agent might be asked to count the number of chairs at a table or identify the color of a bouquet of flowers. SimVQA can quickly augment visual-question-answering datasets by creating tables with extra seats or flowers of many colors, with questions to match.

Next, IBM researchers will see if they can go one step further and train their classifier to outperform those trained on real data. They also want to exploit synthetic data for higher-level vision tasks like detecting objects and animals in scenes, and segmenting images into their components. “These are tasks that require much more detail, to map pixels to objects,” study co-author and IBM researcher Rameswar Panda said. “Synthetic data will potentially have an important role to play in developing these next-generation systems.”

A safer way to learn the physical world

The work is part of a larger effort to ground AI models in the physical world. If AI models can be taught how objects and animals behave in a virtual world, they may better grasp the complexities and unpredictability of ours. An AI that understands how light reflects on surfaces, or varies by time of day, is one step closer to learning how the real world is structured. “You don’t need real data to necessarily learn that structure,” Cox said.

“It’s sufficient to do the heavy lifting with data that’s synthetic, that’s safe.”

Learn more about:

Data Management: The future of computing lies in the hybrid cloud. We're creating a hybrid data fabric that provides secure, governed data access from anywhere.

Computer Vision: Modern computer vision systems have superhuman accuracy when it comes to image recognition and analysis, but they don’t really understand what they see. At IBM Research, we’re designing AI systems with the ability to see the world like we do.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter