Research
4 minute read

An AI model trained on data that looks real but won’t leak personal information

IBM unveils a new method for bringing privacy-preserving synthetic data closer to its real-world analog to improve the predictive value of models trained on it.

IBM unveils a new method for bringing privacy-preserving synthetic data closer to its real-world analog to improve the predictive value of models trained on it.

A revolution in how businesses handle customer data could be around the corner, and it’s based entirely on made-up information.

Banks, health care providers, and other highly regulated fields are sitting on piles of spreadsheet data that could be mined for insights with AI — if only it were easier and safer to access. Sharing data, even internally, comes with a high risk of leaking or exposing sensitive information. And the risks have only increased with the passage of new data-privacy laws in many countries.

Synthetic data has become an essential alternative. This is data that’s been generated algorithmically to mimic the statistical distribution of real data, without revealing information that could be used to reconstruct the original sample. Synthetic data lets companies build predictive models, and quickly and safely test new ideas before going to the effort of validating them on real data.

The standard security guarantee for synthetic data is something called differential privacy. It’s a mathematical framework for proving that synthetic data can’t be traced to its real-world analog, ensuring that it can be analyzed without revealing personal or sensitive information.

But there’s a trade-off. Exactly capturing the statistical properties of the original sample is virtually impossible. Synthetic data with privacy guarantees is always an approximation, which means that predictions made by models trained on it can also be skewed.

Much of the customer data that enterprises collect is in spreadsheet form; words and values organized into rows and columns. “The biggest problem we’re trying to solve is how to recreate highly structured, relational datasets with privacy guarantees,” said Akash Srivastava, synthetic data lead at IBM Research. “Most machine-learning models treat data points as independent, but tabular data is full of relationships.”

The more relationships that are embedded, the greater the chance that someone’s identity might be revealed — even after personal information has been disguised as synthetic data. Businesses typically get around this by adding more statistical noise to their synthetic data to guard against privacy breaches. But the noisier the data gets, the less predictive it becomes.

Unskilled predictions can be especially problematic for groups underrepresented in the original data. A model trained on misleading synthetic data, for example, could recommend rejecting minority loan applicants that would be considered qualified if real data had been used.

IBM researchers have proposed a solution. It’s a technique that allows businesses to essentially clean up the synthetic data they’ve already made so that it performs better on the target task. This means it could be more accurate with something like predicting whether a loan will be repaid.

The solution, to be presented at NeurIPS 2023, brings together an idea from the 1970s called information projection and a standard optimization method known as a compositional proximal gradient algorithm.

“The mishandling of sensitive data can expose companies to huge liabilities,” said the study’s lead author Hao Wang, an IBM researcher at the MIT-IBM Watson AI Lab. “We want to make it easy for data curators to share their data without worrying about privacy breaches.”

A crystal ball made of synthetic data

Synthetic data with privacy guarantees is different from other methods commonly used to guard sensitive information. Data that’s been anonymized, for example, has been stripped of names and other identifying information, but individuals can often be identified by cross-checking the anonymized data against other demographic data. Anonymized data is excellent for building models, but not so great at protecting privacy.

Encrypted data, by contrast, is extremely secure, as long as the secret key to decode the information is kept safe. But encrypted data has no predictive value since it’s been scrambled without preserving its statistical properties.

Synthetic data with privacy guarantees offers a compromise, allowing companies to experiment on data that behaves like the original sample but doesn’t put sensitive information at risk.

It’s made the same way as large language model (LLM)-generated essays and poems; A generative model converts raw data — say, all of Wikipedia — into a simplified representation, and draws on this representation to create something similar, but not identical, to the original data. When privacy guarantees are added, the model is explicitly prevented from outputting data too similar to the real data it trained on.

Whether real data or a synthetic analog is used, the first step in building a predictive model is to create a correlation matrix. This lets a data scientist find features in the data that are most predictive of the target task.

Let’s say the task is to predict whether an applicant will repay their loan. Features that appear highly correlated with the target variable, things like monthly salary or credit score, get selected and are used to train the model.

The problem with synthetic data is that features that look highly predictive in the correlation matrix may not actually be. If non-predictive features are mistakenly picked, the model may misclassify real-world applicants, leading to missed revenue or financial losses if the borrower defaults on the loan.

In the business world, synthetic tabular data has become so popular that there are now several ways of generating it. IBM researchers wondered if the data could be enhanced, regardless of how it was made, by aligning it with real data in the context of the target task.

Under the method IBM devised, the data scientist defines a set of goals, or “utility measures,” related to the target variable. Each utility measure is essentially a function; in the loan repayment example, it could be the correlation coefficient between a past applicant’s monthly income and whether they repaid the loan.

The data scientist computes the average value for each function on real data, then alters the number for privacy protection. From there, an optimization algorithm computes the optimal resampling weights for each data point in the synthetic dataset. These weights are used to toss out data points in the synthetic dataset that are misaligned with the noisy utility measures computed on the real data.

The researchers ran experiments to see how their synthetic data-alignment method performed against real data on several classification tasks. They found that classifiers trained on the resampled data consistently outperformed generic synthetic data — all while maintaining privacy guarantees.

The procedure was also efficient. It took a single GPU about 4 minutes to run. And researchers showed that it could work on a dataset with more than 100 features, suggesting that it could be applied to other large and complex datasets.

What’s next

The work is part of a larger effort at IBM to build a privacy-preserving platform for enterprises to generate and evaluate synthetic data. The platform will include tools to ensure that predictive models built on synthetic data perform as closely as possible to models trained on real data.

Currently, the researchers are applying their method to probabilistic graphical models, which are used to generate synthetic tabular data. But they plan to build privacy controls into large language models (LLMs) next. This way an enterprise could train an LLM on synthetic text data that has privacy guarantees but is still statistically close enough to the real thing to be useful.

“Enterprises want to extract insights from customer data without putting it at risk,” said Srivastava. “We’re working on solutions to do just that.”