Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Characterizing Large Language Model (LLM) serving performance is a combinatorial problem where a suboptimal choice wastes profiling budget: every change in model, hardware, or software version requires fresh profiling, yet exhaustive benchmarking is infeasible. Existing approaches -- simulators and static performance estimators -- lose fidelity on novel architectures or target only optima. We introduce \textbf{Predictive Kernel Herding (PKH)}, a sampler which reformulates Random Forest leaf co-occurrence as linear-time histogram matching, replacing kernel comparisons. On four real-world LLM serving traces spanning 3,000+ configurations, PKH is the only sampler that delivers top-ranked accuracy on both throughput and latency predictions, dominating the cost–accuracy Pareto frontier. PKH predicts output throughput within 10% MAPE and mean Time to First Token (TTFT) within 20% MAPE, reaching practically useful accuracy with up to 1.6× lower profiling time than the next-best method at equivalent error.
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Miao Guo, Yong Tao Pei, et al.
WCITS 2011