Publication
AAAI 2025
Demo paper
EvalAssist: LLM-as-a-judge simplified
Abstract
We demonstrate EvalAssist, a framework that facilitates the use of large language models as evaluators (LLM-as-a-Judge) by supporting users in iteratively refining evaluation criteria in a web-based user experience and applying them to large amounts of data through a Python toolkit. Our framework introduces a new LLM-as-a-judge evaluation approach that adresses issues of evaluation robustness and allows us to compute scores such as positional bias and uncertainty to engender trust in the judged content. We have computed extensive benchmarks and also deployed the system internally in our organization with several hundreds of users.