Workshop paper

Sampling Where It Matters: Predicting LLM Serving Performance

Abstract

Characterizing Large Language Model (LLM) serving performance is a combinatorial problem where a suboptimal choice wastes profiling budget: every change in model, hardware, or software version requires fresh profiling, yet exhaustive benchmarking is infeasible. Existing approaches -- simulators and static performance estimators -- lose fidelity on novel architectures or target only optima. We introduce \textbf{Predictive Kernel Herding (PKH)}, a sampler which reformulates Random Forest leaf co-occurrence as linear-time histogram matching, replacing O(N2)O(N^2) kernel comparisons. On four real-world LLM serving traces spanning 3,000+ configurations, PKH is the only sampler that delivers top-ranked accuracy on both throughput and latency predictions, dominating the cost–accuracy Pareto frontier. PKH predicts output throughput within 10% MAPE and mean Time to First Token (TTFT) within 20% MAPE, reaching practically useful accuracy with up to 1.6× lower profiling time than the next-best method at equivalent error.