Publication
SoCC 2024
Conference paper

Queue Management for Large Language Model Serving

Abstract

The emergence of large language models (LLMs) has introduced excessive computational demands and unique execution patterns (i.e., nondeterministic execution time due to autoregressive patterns) for cloud providers. Consequently, existing LLM serving systems lead to long request queues and fail to enforce the request-serving service-level objectives (SLOs) because no effective way yet exists to translate the high-level SLOs to low-level LLM serving operations (LSOs), such as request eviction and GPU-CPU state swap. We introduce QLM, the first queue management system for multi-model LLM serving that maximizes SLO enforcement while achieving high throughput and utilization on heterogeneous devices. QLM(1) handles the non-determinism of incoming requests in the waiting queue by a highly explainable Bayesian statistical approach, and (2) reorders and assigns requests to devices (model instances) with a stochastic programming solver. The execution order of the request queue automatically translates to the LSOs enabled by the downstream LLM serving system. QLM supports five basic LSOs: request pulling, request eviction, GPU-CPU state swapping, model warm start, and autoscaling, and can be extended to support additional ones. Evaluation of QLM on heterogeneous devices and model types shows up to 100--1000x reduction in queuing time for high-priority requests while improving throughput by up to 20%, resulting in 50-75% higher SLO enforcement compared to the state-of-the-art model serving systems. QLM is moving to production by a major cloud provider.