Text-to-SQL Evaluation Toolkit

Oktie Hassanzadeh; Yotam Perlitz; Nhan Pham; Tanvi Kaple; Karolina Źróbek; Long Vu; Michael Glass; Shankar Subramaniam; Mohammadreza Pourreza; Davood Rafiei

VLDB 2026

Demo paper

31 Aug 2026

Text-to-SQL Evaluation Toolkit

View code

Abstract

Text-to-SQL systems translate natural language questions into executable SQL queries, enabling intuitive access to structured data. While recent large language models have substantially improved generation quality, evaluating these systems remains a complex challenge: SQL semantics are subtle, multiple valid query formulations exist for the same question, and execution-based metrics are implemented inconsistently across the community. We demonstrate Text-to-SQL Evaluation Toolkit, an open-source, modular framework for rigorous evaluation of text-to-SQL systems. The toolkit provides a comprehensive suite of over twelve metrics, spanning execution accuracy, SQL syntactic equivalence, and LLM-as-judge scoring, together with integrated pipelines for inference, SQL execution against real databases, SQL profiling, and detailed error analysis. A web-based dashboard enables interactive exploration of benchmark results, cross-pipeline comparison, and per-record drill-down with live re-evaluation. The demonstration walks attendees through evaluating and comparing text-to-SQL pipelines on both established public benchmarks and new enterprise benchmarks, diagnosing failure patterns, and using LLM-as-judge to assess predictions where traditional metrics fall short.

Paper