BigData Congress 2018
Conference paper

Dynamic Model Evaluation to Accelerate Distributed Machine Learning

View publication


The increase in the volume and variety of data has increased the reliance of data scientists on shared computational resources, either in-house or obtained via cloud providers, to execute machine learning and artificial intelligence programs. This, in turn, has created challenges of exploiting available resources to execute such 'cognitive workloads' quickly and effectively to gather the needed knowledge and data insight. A common challenge in machine learning is knowing when to stop model building. This is often exacerbated in the presence of big data as a trade off between the cost of producing the model (time, volume of training data, resources utilised) and its general performance. Whilst there are many tools and application stacks available to train models over distributed resources, the challenge of knowing when a model is 'good enough' or no longer worth pursuing persists. In this paper, we propose a framework for the evaluating the models produced by distributed machine learning algorithms during the training process. This framework integrates with the cluster job scheduler so as to finalise model training under constraints of resource availability or time, or simply because model performance is asymptotic with further training. We present a prototype implementation of this framework using Apache Spark and YARN, and demonstrate the benefits of this approach using sample applications with both supervised and unsupervised learning algorithms.