Distributed Incremental Machine Learning for Big Time Series Data
Abstract
Today's highly instrumented systems generate large amounts of time series data from many different domains. In order to create meaningful insights from these data, techniques are needed to handle the collection, processing, and analysis at scale. The high frequency and volume of data that is generated introduces several challenges including data transformation, managing concept drift, the operational cost of model re-training and tracking, and scaling hyperparameter optimization.Incremental machine learning can provide a viable solution to handle these kinds of data. Further, distributed machine learning can be an efficient technique to improve performance, increase accuracy, and scale to larger input sizes.In this paper, we introduce a framework that combines the computational capabilities of Apache Spark and the workflow parallelization of Ray for distributed incremental learning. We conduct an empirical analysis of our framework for time series forecasting using the Walmart M5 dataset. The system can perform a parameter search on streaming data with concept drift producing a robust pipeline that fits high-volume data effectively. The results are encouraging and substantiate system proficiency over traditional big data analysis approaches that exclusively use either offline or online training.