Publication
VLDB 2022
Conference paper

DQDF: Data-Quality-Aware Dataframes

Download paper

Abstract

Data quality assessment is an essential process of any data analysis process including machine learning. The process is time-consuming as it involves multiple independent data quality checks that are performed iteratively at scale on evolving data resulting from exploratory data analysis (EDA). Existing solutions that provide computational optimizations for data quality assessment often separate the data structure from its data quality which then requires efforts from users to explicitly maintain state-like information. They demand a certain level of distributed system knowledge to ensure high-level pipeline optimizations from data analysts who should instead be focusing on analyzing the data. We, therefore, propose data-quality-aware dataframes, a data quality management system embedded as part of a data analyst's familiar data structure, such as a Python dataframe. The framework automatically detects changes in datasets' metadata and exploits the context of each of the quality checks to provide efficient data quality assessment on ever-changing data. We demonstrate in our experiment that our approach can reduce the overall data quality evaluation runtime by 40-80% in both local and distributed setups with less than 10% increase in memory usage.