Scalable Distributed Computing Systems for Incremental Machine Learning in Big Data Applications
- 2022
- INFORMS 2022
High-volume and/or high-velocity data is increasingly being generated and analyzed, especially in domains like IoT, banking, retail and Renewable Energy Sources (RES). Consequently, applications that require accurate real-time forecasts and predictions are also steadily increasing. Examples include very short-term forecasting for energy bidding, failure detection in manufacturing, or sentiment analysis on data from sources like Twitter. These issues have been addressed using data compression techniques and incremental machine learning algorithms.
In collaboration with Aalborg University in Denmark and Athena Research Institute in Greece, we designed a novel end-to-end platform that provides efficient ingestion, compression, transfer, query processing, and machine learning-based analytics for high-frequency and high-volume time series from IoT. The performance of the platform was evaluated using real-world dataset from RES installations. The results showed the importance of high-frequency analytics and the surprisingly positive impact of error bounded lossy compression on machine learning in the form of AutoML. For example, when detecting yaw misalignments in wind turbines, an improvement of 9% in accuracy was observed for AutoML models on lossy compressed data compared to the current industry standard of 10-minute aggregated data. Thus, these small-scale experiments show the potential of the platform, and larger pilots are planned. Incremental machine learning algorithms provide an efficient approach for this type of data due to frequent concept drifts and noisy, often missing data. However, AutoML for incremental models can be a challenge. Unlike AutoML in batch learning algorithms, where the best model is chosen offline using a validation dataset and deployed for scoring, AutoML for incremental models requires dynamic and constantly updated to find the best model for a given concept. Similarly, dynamic ensembling, which is of great value for incremental models, requires that the base estimators' weights be regularly updated based on the motif of the incoming data. Dynamic AutoML workflows and ensembling for big data add an additional layer of difficulty as the models need to be computationally efficient while being robust to handle hyper-parameter optimization of multiple variants of incremental learning like online, continual, shallow, and deep learning models that highly vary in complexity.
To address some of these issues, we have developed Streams and Incremental Learning (SAIL) library (open source, MIT license) which provides a common scikit-learn-based interface for some of the variants of incremental learning as mentioned above. It also provides ensembling options with batch learning models compared to existing libraries that keep batch and incremental models independent. Further, distributed computing for AutoML and ensembling within SAIL has been set up using Ray.
The work can be exploited in domains where high frequency, high volume data is becoming more prominent. In MORE, we use these models for data provided by Engie and Inaccess for use-cases based in wind and solar parks respectively.
Read more about MORE. Read more about SAIL (A toolkit for soiling detection in PV parks)