With the enormous growth of data being generated and collected by businesses every day we are also seeing a similar growth in the number and diversity of users (data engineers, data scientists, business analysts, etc.) interacting with the collected data trying to gain insights, take business decisions and ultimately generate value. Their workloads (ad hoc queries, ETL, ELT, machine learning, artificial intelligence, generative AI, etc.) are ever-evolving and their data is spread globally across private and public datacenters, which has driven the continuous development of new tools and data management systems to address those needs. This has led to specialized monoliths, which drive rebuilding and reengineering efforts, rather than reusing existing systems.
One approach is convergence, but convergence is unfortunately complex and costly. Instead, in flex.data we focus on decomposing data management systems into common components and leveraging common APIs and stardards to increase interoperability, reusability and cost/performance. This effectively decouples the language frontend from the execution engine and runtime by introducing an abstract intermediate representation of computation, usually in the form of a query plan, and a cross-engine optimizer that predicts the right engine, execution and deployment plan optimizing for cost, performance and compliance. Our objective is to evolve IBM’s data integration stack while using key open source standards, like Substrait and Apache Calcite.
Some of our work is already available for customers to use in IBM products such as IBM DataStage, IBM ELT Pushdown Express, IBM Analytics Engine and watsonx.data.