Automated data science for relational data

Hoang Thanh Lam; Beat Buesser; Hong Min; Tran Ngoc Minh; Martin Wistuba; Udayan Khurana; Gregory Bramble; Theodoros Salonidis; Dakuo Wang; Horst Samulowitz

doi:10.1109/ICDE51399.2021.00305

ICDE 2021

Conference paper

01 Apr 2021

Automated data science for relational data

View publication

Abstract

Feature engineering is a crucial but tedious task that requires up to 80% of the total time in data science projects. A significant challenge is when data consists of tables from different data sources, thus data scientists need to wisely aggregate and join tables while performing feature engineering task. In this work, we demonstrate a novel system called OneBM (One Button Machine), that enables data scientists to increase their efficiency with automated feature engineering for relational data. OneBM takes as input a relational dataset with multiple tables and its entity relation diagram (ERD) which can be declared with a novel, easy-to-use drag-and-drop graphical user interface. The system then automatically identifies and executes relevant joins and aggregates in the data, and generates new features with a rich set of transformations for various types of data including but not limited to time-series, sequences, number sets and itemsets, etc. The generated features then can be used by automated model selection and hyper-parameter optimization algorithms to complete a fully end-to-end automated data science (or AutoDS) workflow. A follow-up user evaluation illustrated how data scientists can perform multi-table feature engineering tasks in minutes using our system, compared to repeatedly coding SQL-like queries to transform and aggregate relational data requiring weeks of manual labor for comparable performance. In the live demos we plan to show two use cases with real-world datasets (video demos are available at the links in the footnote): sale prediction1 and call center user experience2. Pre-registered partcipants can play with these use-cases and the given datasets via Watson Studio on the cloud.

Paper