Tutorial

Evaluating LLM-based Agents: Foundations, Best Practices and Open Challenges

Abstract

The rapid advancement of Large Language Model (LLM)-based agents has sparked a growing interest in their evaluation, bringing forth both challenges and opportunities. This tutorial provides a comprehensive introduction to evaluating LLM-based agents, catering to participants from diverse backgrounds with little prior knowledge of agents, LLMs, metrics, or benchmarks. We will establish foundational concepts and explore key benchmarks that measure critical agentic capabilities, including planning, tool use, self-reflection, and memory. We will examine evaluation strategies tailored to various agent types, ranging from web-based and software engineering to conversational and scientific applications. We will also cover benchmarks and leaderboards that evaluate generalist agents over diverse skill sets. Additionally, we will review prominent developer frameworks for agent evaluation. Finally, we will present emerging trends in the field, identify current limitations, and propose directions for future research.