Work-in-Progress: Maximizing Model Accuracy in Real-time and Iterative Machine Learning
As iterative machine learning (ML) (e.g. neural network based supervised learning and k-means clustering) becomes more ubiquitous in our daily life, it is becoming increasingly important to complete model training quickly to support real-time decision making, while still achieving high model accuracy (e.g. low prediction errors) that is critical for profits of ML tasks. Motivated by the observation that the small proportions of accuracy-critical input data can contribute to large parts of model accuracy in many iterative ML applications, this paper introduces a system middleware to maximize model accuracy by spending the limited time budget on the most accuracy-related input data. To achieve this, our approach employs a fast method to divide the input data into multiple parts of similar points and represents each part with an aggregated data point. Using these points, it quickly estimates the correlations between different parts and model accuracy, thus allowing ML tasks to process the most accuracy-related parts first. We incorporate our approach with two popular supervised and unsupervised ML algorithms on Spark and demonstrate its benefits in providing high model accuracy under short deadlines.