Synderella

A synthetic data generation service that enables secure and responsible analysis of privacy-sensitive data

Archived

Overview

Companies are struggling to unlock the value of their sensitive data due to several challenges, including:

Long wait times to access sensitive datasets: Internal data scientists often have to wait months to gain access to the data they need to do their jobs. This is due to extensive internal bureaucracies and regulations prohibiting data from being shared externally or across geographic regions.
Regulations: Regulations such as GDPR, CCPA, and FTC prohibit using certain types of data, such as customer personal information, in specific ways. This makes it difficult for companies to use their data to its full potential.
Data security risks: Data breaches are a significant concern for companies that store sensitive data. This risk limits the use of real data in vendor evaluation or public cloud environments.

Synderella solves these challenges — it's a no-code platform for creating realistic synthetic versions of sensitive tabular datasets that preserve the trends, signals, and relationships from real data. This data can be used to build and train software or predictive AI models. However, unlike real datasets, this synthetic data contains no customer information and is therefore not subject to the same regulatory or ethical considerations.

How it works

Synderella relies on generative AI models to first learn a representation of a sensitive dataset and then to generate large volumes of new, fake data that behave according to that learned representation.

To ensure no sensitive information is leaked from the real data to the synthetic data, the platform leverages the mathematical concept of “differential privacy,” which adds noise to the synthetic data to obscure the presence of rare individuals in the underlying training dataset.

Synderella can be run wherever sensitive data is stored, whether on a private cloud or on-premises.

Publications

Post-processing Private Synthetic Data for Improving Utility on Selected Measures
- - Hao Wang
  - Shivchander Sudalairaj
  - et al.
- 2023
- NeurIPS 2023
Poster

Resources

Blog Post

What is synthetic data?

IBM Research BlogFeb 8, 2023