Publication
Big Data 2020
Conference paper

DQLearn : A Toolkit for Structured Data Quality Learning

View publication

Abstract

Data Quality (DQ) has been one of the key focuses as Data Analytics and Artificial Intelligence (AI) fields continue to grow. Yet, data quality analysis has mostly been a disjointed, ad-hoc, and cumbersome process in the overall data analysis workflow. There have been ongoing attempts to formalize this process, but the solutions that have come out are not universally applicable. Most of the proposed solutions try to address the problem of data quality from a limited perspective and suc-cessfully address only a subset of all challenges. These solutions fail to translate to other domains due to a lack of structure. In this paper, we present DQLearn, a toolkit for structured data quality learning. We start by presenting the core principle on which we build our library and introduce the four components that provide a solid base to address the needs of the data quality problem. Then, we showcase our automation structure - "Workflows", and the two optimization techniques equipped with it, that help the users to structure their learning problem very easily. Next, we discuss four important scenarios of the DQ Workflows in the overall life-cycle. Finally, we demonstrate the utility of the proposed toolkit with public datasets and show benchmark results from optimization experiments.

Date

10 Dec 2020

Publication

Big Data 2020

Authors

Share