Extensible Data Skipping

Paula Ta-Shma; Guy Khazma; Gal Lushi; Oshrit Feder

doi:10.1109/BigData50022.2020.9377740

Big Data 2020

Conference paper

10 Dec 2020

Extensible Data Skipping

View publication

Abstract

Data skipping reduces I/O for SQL queries by skipping over irrelevant data objects (files) based on their metadata. We extend this notion by allowing developers to define their own data s kipping metadata types and indexes using a flexible A PI. Our framework i s t he first to natively support data skipping for arbitrary data types (e.g. geospatial, logs) and queries with User Defined Functions ( UDFs). We integrated our framework with Apache Spark and it is now deployed across multiple products/services at IBM. We present our extensible data skipping APIs, discuss index design, and implement various metadata indexes, requiring only around 30 lines of additional code per index. In particular we implement data skipping for a third party library with geospatial UDFs and demonstrate speedups of two orders of magnitude. Our centralized metadata approach provides a x3.6 speed up even when compared to queries which are rewritten to exploit Parquet min/max metadata. We demonstrate that extensible data skipping is applicable to broad class of applications, where user defined indexes achieve significant speedups and cost savings with very low development cost.

Conference paper