Sampling Where It Matters: Predicting LLM Serving Performance

Emile Aydar; Christian Pinto; Srikumar Venugopal; Dimitris Chatzopoulos

doi:10.1145/3805621.3807633

EuroMLSys 2026

Workshop paper

27 Apr 2026

Sampling Where It Matters: Predicting LLM Serving Performance

Download paper

Abstract

Characterizing Large Language Model (LLM) serving performance is a combinatorial problem where a suboptimal choice wastes profiling budget: every change in model, hardware, or software version requires fresh profiling, yet exhaustive benchmarking is infeasible. Existing approaches -- simulators and static performance estimators -- lose fidelity on novel architectures or target only optima. We introduce \textbf{Predictive Kernel Herding (PKH)}, a sampler which reformulates Random Forest leaf co-occurrence as linear-time histogram matching, replacing $O(N^2)$ kernel comparisons. On four real-world LLM serving traces spanning 3,000+ configurations, PKH is the only sampler that delivers top-ranked accuracy on both throughput and latency predictions, dominating the cost–accuracy Pareto frontier. PKH predicts output throughput within 10% MAPE and mean Time to First Token (TTFT) within 20% MAPE, reaching practically useful accuracy with up to 1.6× lower profiling time than the next-best method at equivalent error.