Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Text-to-SQL systems translate natural language questions into executable SQL queries, enabling intuitive access to structured data. While recent large language models have substantially improved generation quality, evaluating these systems remains a complex challenge: SQL semantics are subtle, multiple valid query formulations exist for the same question, and execution-based metrics are implemented inconsistently across the community. We demonstrate Text-to-SQL Evaluation Toolkit, an open-source, modular framework for rigorous evaluation of text-to-SQL systems. The toolkit provides a comprehensive suite of over twelve metrics, spanning execution accuracy, SQL syntactic equivalence, and LLM-as-judge scoring, together with integrated pipelines for inference, SQL execution against real databases, SQL profiling, and detailed error analysis. A web-based dashboard enables interactive exploration of benchmark results, cross-pipeline comparison, and per-record drill-down with live re-evaluation. The demonstration walks attendees through evaluating and comparing text-to-SQL pipelines on both established public benchmarks and new enterprise benchmarks, diagnosing failure patterns, and using LLM-as-judge to assess predictions where traditional metrics fall short.
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Miao Guo, Yong Tao Pei, et al.
WCITS 2011