About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICDE 2023
Conference paper
Synthetic Data Generation for Enterprise DBMS
Abstract
A critical need for enterprise DBMS vendors is to generate synthetic databases for testing their engines and applications in a range of environments. These synthetic databases are targeted toward capturing the desired schematic properties, and the statistical profiles of the data hosted on these schemas.Several data generation frameworks have been proposed for OLAP over the past three decades. The early efforts focused on ab initio generation based on standard mathematical distributions. Subsequently, there was a shift to database-dependent regeneration, which aims to create a database with similar statistical properties to a specific client database. This client-specific perspective has been taken further in recent times through workload-dependent database regeneration, where the databases generated ensure similar query executions to those observed at the client site.In this tutorial, we present a holistic coverage of synthetic data generation, highlighting the strengths and limitations of the above-mentioned framework classes. At the end, a suite of open technical problems and future research directions are enumerated.