CLOUD 2020
Conference paper

Analysis of SQL Workloads on an Enterprise Datalake

View publication


Over the last three years we have been running a large-scale data processing platform for applying analytics to corporate data on a private cloud instance. We control every level in the stack from the processing engines down to the hardware. One very common pattern of usage is for data scientists to use SQL/Hadoop to explore and analysis data sets. Data scientists are free to run whatever queries they want on this shared environment. Here we report on the patterns of usage of data scientists and the measured performance of the queries they create. We motivate why it is difficult to estimate the resource usage of a SQL query on such a system ahead of time and explain the consequences for the design of enterprise datalakes.


18 Oct 2020


CLOUD 2020