Gang Wang, Fei Wang, et al.
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
This paper investigates the challenges and opportunities in evaluating outputs generated by large language models (LLMs) at scale. With LLMs increasingly integrated into applications ranging from customer service to content creation, organizations face significant hurdles in assessing qualitative aspects such as accuracy, coherence, bias, and compliance with brand guidelines. Through a comprehensive literature study and comparative analysis of existing evaluation tools—including both graphical interfaces and code-driven systems—this research identifies critical challenges in scalability, multi-criteria support, aggregation of results, and transparency. Complementing the literature review, contextual inquiries with professionals from diverse technical backgrounds provided insights into user preferences and practical challenges in evaluating extensive datasets of LLM outputs. Based on these findings, we propose design recommendations for next-generation LLM evaluation tools, emphasizing advanced filtering and drill-down capabilities, multi-level aggregated insights that combine quantitative and qualitative analyses, iterative refinement of evaluation criteria to adapt to evolving requirements, and interactive visualizations that elucidate the underlying scoring processes. The recommendations aim to enhance the reliability and trustworthiness of evaluation systems, ultimately supporting more efficient and nuanced assessments of LLM performance across varied real-world applications.
Gang Wang, Fei Wang, et al.
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Atsuyoshi Nakamura, Naoki Abe
Electronic Commerce Research
Kun Wang, Juwei Shi, et al.
PACT 2011
Benny Kimelfeld, Yehoshua Sagiv
ICDT 2013