Publication
ICDE 2017
Conference paper

Too big to eat: Boosting analytics data ingestion from object stores with scoop

View publication

Abstract

Extracting value from data stored in object stores, such as OpenStack Swift and Amazon S3, can be problematic in common scenarios where analytics frameworks and object stores run in physically disaggregated clusters. One of the main problems is that analytics frameworks must ingest large amounts of data from the object store prior to the actual computation; this incurs a significant resources and performance overhead. To overcome this problem, we present Scoop. Scoop enables analytics frameworks to benefit from the computational resources of object stores to optimize the execution of analytics jobs. Scoop achieves this by enabling the addition of ETL-Type actions to the data upload path and by offloading querying functions to the object store through a rich and extensible active object storage layer. As a proof-of-concept, Scoop enables Apache Spark SQL selections and projections to be executed close to the data in OpenStack Swift for accelerating analytics workloads of a smart energy grid company (GridPocket). Our experiments in a 63-machine cluster with real IoT data and SQL queries from GridPocket show that Scoop exhibits query execution times up to 30x faster than the traditional "ingest-Then-compute" approach.