About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
CLOUD 2020
Conference paper
Analysis of SQL Workloads on an Enterprise Datalake
Abstract
Over the last three years we have been running a large-scale data processing platform for applying analytics to corporate data on a private cloud instance. We control every level in the stack from the processing engines down to the hardware. One very common pattern of usage is for data scientists to use SQL/Hadoop to explore and analysis data sets. Data scientists are free to run whatever queries they want on this shared environment. Here we report on the patterns of usage of data scientists and the measured performance of the queries they create. We motivate why it is difficult to estimate the resource usage of a SQL query on such a system ahead of time and explain the consequences for the design of enterprise datalakes.