About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SBAC-PAD 2019
Conference paper
AI gauge: Runtime estimation for deep learning in the cloud
Abstract
Major cloud providers, including IBM Cloud, Amazon Web Services, Microsoft Azure, and Google Cloud, offer services to train, debug, store, and deploy machine learning models at scale. For enhanced user experience in SLA-driven control, cost effective budgeting, elastic scaling, and efficient operations, estimating the runtime of training a machine learning model is important. We present AI Gauge, a cloud service to estimate runtime and cost for training deep learning models under different configuration options on the cloud. AI Gauge is designed using micro-service architecture and performs estimations based on machine learning models calibrated by an extensive and continuously populated job trace data-set. We show that AI Gauge can accurately predict the remaining time of running jobs based on its runtime progress (< 10% relative error) and can accurately predict the total runtime for a job before it starts with 7-8% relative error on average.