In the new era where IoT, Big Data, and analytics meet on the cloud, enormous amounts of data, generated by IoT, are streamed and stored on the cloud for continuous analysis by analytic workloads. The scale of the throughput and size of the data processing, combined with the scale and dynamic nature of the cloud resources, are creating new challenges as well as new opportunities. This is where our work is focused.
We create the 'pipes' through which the data is streamed from its source to destination (e.g., from IoT frameworks to object storage, from storage to analytics applications) faster and smarter - optimizing both the streaming velocity and data transformation to make it analytics-ready.
We work to improve the performance of analytic workloads on the cloud (such as those using Apache Spark) so they better exploit the scale and dynamic nature of the cloud and the data. Cloud operations at scale generates new types of problems that impact continuous availability and are difficult to debug. To help with this, we are developing a diagnostic framework that integrates root cause analysis with first-aid recovery operations, at scale. An important aspect is adapting to new economic models associated with cloud resources - we are working on cost optimization and a smart cloud brokering service.
Our major activities include:
Spark File Filters
We developed File Filters that extend the existing Spark SQL partitioning mechanism to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.). Unlike pushdown compatible formats such as Parquet, which require touching each object to determines its relevance, we don’t access irrelevant objects. This is essential for object storage, where every REST request incurs significant overhead and financial cost. Our pluggable interface for developing and deploying File Filters is easy to use. One example, in which we implemented filters objects according to their metadata, is illustrated below.
Motivation
The Internet of Things is swamping our planet with data, while IoT use cases drive demand for analytics on extremely scalable and low-cost storage. Enter Spark SQL over Object Storage (e.g., Amazon S3, IBM Cloud Object Storage) – highly scalable and low-cost storage that provides RESTful APIs to store and retrieve objects and their metadata.
In our world of separately managed microservices, for any given query, limiting the flow of data from object storage to Spark is critical to avoid shipping huge datasets. Existing techniques involve partitioning the data and using specialized formats such as Parquet, which support pushdown. Partitioning can be inflexible because of its static nature - only one partitioning hierarchy is possible and it can’t be changed without rewriting the entire dataset. File Filters for Spark is changing all that.
A Bridge from Message Hub to Object Storage
Message Hub is IBM Bluemix’s hosted Kafka service, enabling application developers to leave the work of running and maintaining a Kafka cluster to messaging experts and focus on their application logic. We extended Message Hub with a bridge from Message Hub to the Bluemix Object Storage service, which persists a topic’s messages as analytics-ready objects.
With the Object Storage bridge, data pipelines from Message Hub to Object Storage can be easily set up and managed to generate Apache Spark conformant data, which can be analyzed directly by the IBM Data Science Experience using Spark as a Service.
A typical IoT data pipeline is shown below where IoT device data is collected by the Watson IoT Platform, and then streamed to Object Storage via Message Hub and the Object Storage bridge. This data is then analyzed using Spark. A use case scenario demonstrating this kind of architecture and the use of the Object Storage bridge is described in Paula Ta-Shma’s blog.
Stocator
Stocator (Apache License 2.0), a unique object store connector for Apache Spark. Stocator is written in Java, and implements an HDFS interface to simplify its integration with Spark. There is no need to modify Spark’s code to use the new connector: it can be compiled with Spark via a Maven build, or provided as an input package without the need to re-compile Spark.
Stocator works smoothly with the Hadoop Mapreduce Client Core module, but uses an object store approach. Apache Spark continues working with Hadoop Mapreduce Client Core and Spark can use various existing Hadoop output committers. Stocator intercepts internal requests and adapts them to the object store before they access the object store.
For example, when Stocator receives a request to create a temporary object or directory, it creates the final object instead of a temporary one. Stocator also uses streaming to upload objects. Uploading an object via streaming, without knowing its content length in advance, eliminates the need to create the object locally, calculate its length, and only then upload it to the object store. Swift is unique in its ability to support streaming, as compared to other object store solutions available on the market.
For more info on Stocator visit https://github.com/SparkTC/stocator