Why 90% Accuracy in Text-to-SQL is 100% Useless

I have been working in the Analytics space for over 20 years. Back then, it was not called “analytics”, it was “Business Intelligence” or even “Decision Support Systems” in older times. The terms change, from data warehouses to Big Data, to lakehouses, and now with AI, the essence and the eternal promise of self-service Analytics remains the same: extracting truth from data to empower users without relying on someone from the data team. AI without humans in the loop? That sounds controversial.

With the advent of Large Language Models (LLMs), one use case I find fascinating is developing conversational interfaces to chat with databases (Text-to-SQL). The potential here is immense, promising to democratize data access across organizations.

However, for this specific use case, the solution has to be binary. It either works or it doesn’t.

An accuracy of 80% or even 90% is, unfortunately, not enough. Giving your end-users an AI analytical application that hallucinates tables or misinterprets filters is no joke. You cannot compromise on accuracy because it immediately erodes trust. And what happens when a system loses trust? It will not be used. Adoption will decline, without forgetting the catastrophic risk of business decisions being made based on the wrong data.

The Complexity of the RAG Pipeline

I started my research on this topic over one year and a half ago and it quickly became clear that orchestrating a robust Text-to-SQL RAG (Retrieval-Augmented Generation) application is not trivial. You need multiple components in your pipeline, working in perfect harmony:

An intent classifier to detect the goal of the question.
A vector database to store additional context (like business definitions) that the language models need.
An embeddings model to vectorize this additional knowledge.
A retrieval mechanism for the stored data.
Access to the database.
The ability to generate SQL in the specific dialect of the database.
And the ability to evaluate the results.

This last part, evaluation, I believe is often omitted or treated as an afterthought, but it is perhaps the most crucial component for ensuring the reliability needed in an enterprise setting.

BigQuery: A Case Study in Native AI Integration

Managing this complex pipeline often requires integrating multiple platforms. I was recently impressed by how BigQuery has introduced the merger of Analytics and Generative AI natively in their platform.

You have the ability to work with your SQL in the BigQuery IDE and use Gen AI immediately without going to another platform or product. For example: you can query the database and the retrieved results can be immediately sent to Gemini (or through Vertex you can also add other models). You can use Gemini to classify intent, create embeddings and store them in BigQuery’s vector database capabilities, do a semantic search, and generate SQL.

All of that with only one platform, without the hassle of managing multiple subscriptions.

Of course, like everything in life, it has its drawbacks.

One of the main cons is that BigQuery might not be the cheapest database, and I have heard stories of startups where a single wrong query can drain your credit card. It has not happened to me, but I can relate to how this can happen. Another con would be that you get entirely locked-in with Google. Maybe that is not a bad thing; the same way we are all locked in with Gmail. Perhaps in the future, AI will be a commodity, the way emails are now.

Another drawback is the lack of granular traceability of the cost of the tokens and a sort of “mock LLM” for development; you do not want to really use the real expensive LLM at your development stage.

If you are okay with the cons above, you get a magnificent product that combines multiple tools into one single cloud platform which can handle big data massively.

I have created the following repo which was part of the Kaggle hackathon, where I explored these BigQuery native capabilities further. For more info please visit the repo here:

https://github.com/garyzava/bigq-ethereum-rag

The Missing Piece: Rigorous Evaluation

A whimsical visualization of the multi-layered evaluation framework required for robust Text-to-SQL systems, assessing predicted answers against gold standards using metrics ranging from Execution Accuracy to LLM-based judging. Image by author using Google’s Gemini.

Now, going back to eval frameworks. Platforms like BigQuery simplify the architecture, but they do not automatically solve the accuracy problem. I see multiple solutions out there, but most of them lack robust evaluation capabilities.

If we accept that Text-to-SQL must be binary (correct or incorrect), we need evaluation strategies that reflect the messy reality of enterprise data, not the pristine environments of academic or demo datasets.

Evaluating a Text-to-SQL system is notoriously difficult due to the declarative nature of SQL and how complex your database schema is. Does it have thousands of tables? Are those tables well documented? Probably not. Are naming conventions consistent across all tables?. Two queries can look completely different syntactically (e.g., different join orders, aliasing, or use of CTEs) yet produce identical results.

To truly benchmark your RAG application during development and in production, you must use the right metrics.

Metrics That Matter

Going back to the promise of self-service BI or analytics, this means the end-user is relying 100% on themselves; unfortunately, there is no human-in-the-loop or data expert to validate the results. Because of this, we need to establish an explainable AI or evaluation framework with a set of metrics to measure the quality of the generated SQL.

The Shift to Execution Accuracy (EX): Early benchmarks relied on Exact Match (EM), which compared the predicted SQL string to the ground truth. This was deeply flawed, as it penalized valid syntactic variations. The modern standard is Execution Accuracy (EX). This metric executes both the predicted SQL and the “Gold” (ground truth) SQL against the actual database and compares the returned result sets. This correctly validates queries regardless of how they are written.
Focused Evaluation: In enterprise contexts, a query might return extra, non-essential columns (e.g., an ID column used for a join). Strict execution accuracy might mark this as a failure. “Execution-based focused evaluation” allows for a more nuanced comparison, checking if the target columns and values are correct, while being more lenient on extraneous data or row ordering.
The “Soft-F1” Metric: To mitigate the binary nature of Execution Accuracy (where one wrong cell fails the entire test), Soft-F1 is increasingly used. This metric provides partial credit by calculating the overlap between the predicted and gold results. If a query returns 99 out of 100 correct rows, Soft-F1 reflects high performance, whereas EX would return 0. This is crucial for debugging.
LLM-as-a-Judge: Sometimes execution is impossible (e.g., missing private data, environment errors). In these cases, an advanced LLM can be prompted to compare the semantic logic of the predicted SQL against the Gold SQL. While less objective than execution, it correlates highly with human judgment.

Spider 2.0: The Enterprise Reality Check

Currently there are three remarkable evaluation frameworks: Spider 2.0, BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL) and SynSQL (based on synthetic data). However, the industry has been suffering from a false sense of security created by outdated benchmarks. For years, the industry relied on Spider 1.0. It focused on small, clean SQLite databases (averaging fewer than 10 tables). Models were achieving 90%+ accuracy, leading many to believe the problem was “solved.”

The framework that I always emphasize, which incorporates these modern metrics and truly tests enterprise readiness, is Spider 2.0.

Spider 2.0 (released in conjunction with ICLR 2025) is a paradigm shift, designed to address this “reality gap” by introducing the complexities that break LLMs in production:

Massive Scale: Enterprise schemas are huge. Spider 2.0 databases average 812 columns, with some exceeding 3,000. This scale often exceeds the LLM’s context limits, forcing models to employ “Schema Linking” (retrieval) strategies just to identify the relevant tables before generating SQL.
Dialect Diversity: Real companies use Snowflake, BigQuery, and T-SQL, not just SQLite. Spider 2.0 enforces dialect diversity, requiring models to master specific syntax (e.g., handling nested JSON data using UNNEST or FLATTEN).
External Knowledge: Business logic (like the definition of “Churn Rate”) resides in documentation or project codebases (like DBT), not the schema. Spider 2.0 simulates this by providing external files (Markdown, YAML) the model must read to ground its reasoning.
The Agentic Workflow: Crucially, Spider 2.0 models the workflow of a modern data engineer. It moves beyond static translation, evaluating the model’s ability to explore the file system, read documentation, interact with live database instances, and debug errors iteratively.

The difference in difficulty is stark. Models that dominate Spider 1.0 see their success rates drop to 10-20% on the full Spider 2.0 benchmark, highlighting the deficiencies of current LLMs when faced with real-world complexity.

Conclusion: The Binary Bar for Enterprise Data

The journey from Business Intelligence to AI-driven analytics has been marked by increasing abstraction, but the fundamental requirement for data integrity remains unchanged. While the promise of Text-to-SQL is closer than ever, we must resist the allure of high scores on outdated benchmarks.

Achieving 90% accuracy might be academically interesting, but in the enterprise, it is industrially useless. The bar is binary: it works or it breaks trust.

As platforms like BigQuery simplify the integration of AI and data, it is imperative that we simultaneously adopt sophisticated evaluation methodologies and rigorous benchmarks like Spider 2.0. Only by testing against the messy reality of enterprise data can we develop Text-to-SQL applications reliable enough to bet the business on.

Until next time, I hope you have found this topic as fascinating as I do.