Publication
CLOUD 2023
Conference paper

xCloudServing: Automated and Optimized ML Serving across Clouds

Abstract

As machine learning (ML) models have grown in complexity, so too have the expenses they incur when deployed in the cloud. In order to reduce the costs associated with ML serving, it is necessary to optimize the choice of cloud infrastructure used. Additionally, the chosen infrastructure must be able to deliver on the latency constraints that are typically defined for cloud services. This problem is made more challenging since today’s organizations often need to work with more than one cloud provider, and each provider offers its own unique set of interfaces and infrastructure choices. In this work we present xCloudServing – a novel system for consistent and automated deployment of ML inference services across multiple cloud providers and regions. We describe the architecture and implementation of xCloudServing, as well as the different optimization algorithms implemented internally. These include established methods from the literature, as well as Niebo – our novel algorithm for minimizing cost whilst satisfying the tail latency constraint. We present simulation results for 5 different ML models over 3 cloud providers and multiple tail latency constraints that indicate that on average, Niebo outperforms state-of-the-art algorithms by 37%. Additionally, we evaluate xCloudServing with live runs and demonstrate that it is robust to nondeterministic effects and exhibits reproducible behavior.