Workshop paper
Axiom-Aware FunSearch for Non-Constructive Mathematics
Max Esposito, Besart Shyti
NeurIPS 2025
This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.
Max Esposito, Besart Shyti
NeurIPS 2025
Jung koo Kang
NeurIPS 2025
Isha Puri, Shivchander Sudalairaj, et al.
NeurIPS 2025
Djallel Bouneffouf, Matthew Riemer, et al.
NeurIPS 2025