Designing Scalable and Transparent Interfaces for Multi-Criteria Evaluation of LLM Outputs

Lakshya Sharma; Amin El Asery; Zahra Ashktorab; Qian Pan; Justin Weisz

CHI 2025

Workshop paper

26 Apr 2025

Designing Scalable and Transparent Interfaces for Multi-Criteria Evaluation of LLM Outputs

Abstract

This paper investigates the challenges and opportunities in evaluating outputs generated by large language models (LLMs) at scale. With LLMs increasingly integrated into applications ranging from customer service to content creation, organizations face significant hurdles in assessing qualitative aspects such as accuracy, coherence, bias, and compliance with brand guidelines. Through a comprehensive literature study and comparative analysis of existing evaluation tools—including both graphical interfaces and code-driven systems—this research identifies critical challenges in scalability, multi-criteria support, aggregation of results, and transparency. Complementing the literature review, contextual inquiries with professionals from diverse technical backgrounds provided insights into user preferences and practical challenges in evaluating extensive datasets of LLM outputs. Based on these findings, we propose design recommendations for next-generation LLM evaluation tools, emphasizing advanced filtering and drill-down capabilities, multi-level aggregated insights that combine quantitative and qualitative analyses, iterative refinement of evaluation criteria to adapt to evolving requirements, and interactive visualizations that elucidate the underlying scoring processes. The recommendations aim to enhance the reliability and trustworthiness of evaluation systems, ultimately supporting more efficient and nuanced assessments of LLM performance across varied real-world applications.

Paper