C.A. Micchelli, W.L. Miranker
Journal of the ACM
The rapid advancement of Large Language Model (LLM)-based agents has sparked a growing interest in their evaluation, bringing forth both challenges and opportunities. This tutorial provides a comprehensive introduction to evaluating LLM-based agents, catering to participants from diverse backgrounds with little prior knowledge of agents, LLMs, metrics, or benchmarks. We will establish foundational concepts and explore key benchmarks that measure critical agentic capabilities, including planning, tool use, self-reflection, and memory. We will examine evaluation strategies tailored to various agent types, ranging from web-based and software engineering to conversational and scientific applications. We will also cover benchmarks and leaderboards that evaluate generalist agents over diverse skill sets. Additionally, we will review prominent developer frameworks for agent evaluation. Finally, we will present emerging trends in the field, identify current limitations, and propose directions for future research.
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR