EuroSys 2024
Workshop paper

Towards Pareto Optimal Throughput in Small Language Model Serving


Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks, and show impressive zero-shot and few-shot capabilities in a wide range of applications. Although deploying language models is computationally and memory-intensive, the rise of Small Language Models (SLMs) offers new opportunities for a resource-constrained user, that is now able to serve small models with SOTA performances. Also, it introduces a unique scenario where a single accelerator can manage the memory requirements for storing large batches. Increasing the batch size has been previously associated with compute bound scenarios but there is a lack of experimental support for this intuition, primarily because the focus has been on LLMs where large batch sizes are rarely reached. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis offers a new perspective in serving and opens new doors in multi-model scheduling. Additionally, we provide a first set of results on how model replication can effectively improve resource utilization.