Refactoring ETL Flows in The Wild

Dolev Adas; Ohad Eytan; Josep Sampé; Paula Ta-Shma; Guy Kazma

Big Data 2023

Workshop paper

15 Dec 2023

Refactoring ETL Flows in The Wild

Abstract

In modern data-driven ecosystems, Extract, Transform, Load (ETL) flows serve as the backbone of data integration pipelines. These flows facilitate the seamless movement of data across disparate systems and formats, streamlining processes that range from data acquisition to preparation for analysis. However, the pervasive use of ETL flows introduces a pressing challenge—how to bound the maintenance cost of an everexpanding number of flows. In this paper, we describe an end-to-end prototype for ETL flow refactoring, aimed at reducing the maintenance cost, which keeps the human in the loop for refactoring decisions. Our prototype adopts and significantly extends the gSpan Frequent Subgraph Mining (FSM) algorithm to apply it to real-world ETL use cases in the context of the IBM DataStage™ data integration tool. We report on real customer workloads, share their statistics and evaluate the performance of our prototype. We found potential for up to 32% maintenance cost reduction on the use cases we analyzed even after removing duplicate flows. Index Terms—data flows, subflows, ETL, data integration, frequent subgraph mining.

Paper