Conference paper

JuStRank: Benchmarking LLM Judges for System Ranking

Abstract

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versa- tility of such evaluations make the use of LLM- based judges a compelling solution for this chal- lenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source sys- tems. We argue that this setting overlooks criti- cal factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we con- duct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is as- sessed by comparing the resulting system rank- ing to a human-based ranking. Beyond over- all judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

Related