Analysis of SQL Workloads on an Enterprise Datalake
Over the last three years we have been running a large-scale data processing platform for applying analytics to corporate data on a private cloud instance. We control every level in the stack from the processing engines down to the hardware. One very common pattern of usage is for data scientists to use SQL/Hadoop to explore and analysis data sets. Data scientists are free to run whatever queries they want on this shared environment. Here we report on the patterns of usage of data scientists and the measured performance of the queries they create. We motivate why it is difficult to estimate the resource usage of a SQL query on such a system ahead of time and explain the consequences for the design of enterprise datalakes.