KDD 2022

Advances in exploratory data analysis, visualisation and quality for data centric AI systems


It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this tutorial, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. To make the tutorial actionable for practitioners, we will also discuss the most popular open-source packages that one can get started with along with their strengths and weaknesses. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.