• About
  • Advertise
  • Privacy & Policy
  • Contact
Tuesday, January 13, 2026
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

Why 90% Accuracy in Text-to-SQL is 100% Useless

AiNEWS2025 by AiNEWS2025
2026-01-13
in Machine Learning
0
Why 90% Accuracy in Text-to-SQL is 100% Useless
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


I have been working in the Analytics space for over 20 years. Back then, it was not called “analytics”, it was “Business Intelligence” or even “Decision Support Systems” in older times. The terms change, from data warehouses to Big Data, to lakehouses, and now with AI, the essence and the eternal promise of self-service Analytics remains the same: extracting truth from data to empower users without relying on someone from the data team. AI without humans in the loop? That sounds controversial.

With the advent of Large Language Models (LLMs), one use case I find fascinating is developing conversational interfaces to chat with databases (Text-to-SQL). The potential here is immense, promising to democratize data access across organizations.

However, for this specific use case, the solution has to be binary. It either works or it doesn’t.

An accuracy of 80% or even 90% is, unfortunately, not enough. Giving your end-users an AI analytical application that hallucinates tables or misinterprets filters is no joke. You cannot compromise on accuracy because it immediately erodes trust. And what happens when a system loses trust? It will not be used. Adoption will decline, without forgetting the catastrophic risk of business decisions being made based on the wrong data.

The Complexity of the RAG Pipeline

I started my research on this topic over one year and a half ago and it quickly became clear that orchestrating a robust Text-to-SQL RAG (Retrieval-Augmented Generation) application is not trivial. You need multiple components in your pipeline, working in perfect harmony:

  • An intent classifier to detect the goal of the question.
  • A vector database to store additional context (like business definitions) that the language models need.
  • An embeddings model to vectorize this additional knowledge.
  • A retrieval mechanism for the stored data.
  • Access to the database.
  • The ability to generate SQL in the specific dialect of the database.
  • And the ability to evaluate the results.

This last part, evaluation, I believe is often omitted or treated as an afterthought, but it is perhaps the most crucial component for ensuring the reliability needed in an enterprise setting.

BigQuery: A Case Study in Native AI Integration

Managing this complex pipeline often requires integrating multiple platforms. I was recently impressed by how BigQuery has introduced the merger of Analytics and Generative AI natively in their platform.

You have the ability to work with your SQL in the BigQuery IDE and use Gen AI immediately without going to another platform or product. For example: you can query the database and the retrieved results can be immediately sent to Gemini (or through Vertex you can also add other models). You can use Gemini to classify intent, create embeddings and store them in BigQuery’s vector database capabilities, do a semantic search, and generate SQL.

All of that with only one platform, without the hassle of managing multiple subscriptions.

Of course, like everything in life, it has its drawbacks.

One of the main cons is that BigQuery might not be the cheapest database, and I have heard stories of startups where a single wrong query can drain your credit card. It has not happened to me, but I can relate to how this can happen. Another con would be that you get entirely locked-in with Google. Maybe that is not a bad thing; the same way we are all locked in with Gmail. Perhaps in the future, AI will be a commodity, the way emails are now.

Another drawback is the lack of granular traceability of the cost of the tokens and a sort of “mock LLM” for development; you do not want to really use the real expensive LLM at your development stage.

If you are okay with the cons above, you get a magnificent product that combines multiple tools into one single cloud platform which can handle big data massively.

I have created the following repo which was part of the Kaggle hackathon, where I explored these BigQuery native capabilities further. For more info please visit the repo here:

https://github.com/garyzava/bigq-ethereum-rag

The Missing Piece: Rigorous Evaluation

A whimsical visualization of the multi-layered evaluation framework required for robust Text-to-SQL systems, assessing predicted answers against gold standards using metrics ranging from Execution Accuracy to LLM-based judging. Image by author using Google’s Gemini.

Now, going back to eval frameworks. Platforms like BigQuery simplify the architecture, but they do not automatically solve the accuracy problem. I see multiple solutions out there, but most of them lack robust evaluation capabilities.

If we accept that Text-to-SQL must be binary (correct or incorrect), we need evaluation strategies that reflect the messy reality of enterprise data, not the pristine environments of academic or demo datasets.

Evaluating a Text-to-SQL system is notoriously difficult due to the declarative nature of SQL and how complex your database schema is. Does it have thousands of tables? Are those tables well documented? Probably not. Are naming conventions consistent across all tables?. Two queries can look completely different syntactically (e.g., different join orders, aliasing, or use of CTEs) yet produce identical results.

To truly benchmark your RAG application during development and in production, you must use the right metrics.

Metrics That Matter

Going back to the promise of self-service BI or analytics, this means the end-user is relying 100% on themselves; unfortunately, there is no human-in-the-loop or data expert to validate the results. Because of this, we need to establish an explainable AI or evaluation framework with a set of metrics to measure the quality of the generated SQL.

  1. The Shift to Execution Accuracy (EX): Early benchmarks relied on Exact Match (EM), which compared the predicted SQL string to the ground truth. This was deeply flawed, as it penalized valid syntactic variations. The modern standard is Execution Accuracy (EX). This metric executes both the predicted SQL and the “Gold” (ground truth) SQL against the actual database and compares the returned result sets. This correctly validates queries regardless of how they are written.
  2. Focused Evaluation: In enterprise contexts, a query might return extra, non-essential columns (e.g., an ID column used for a join). Strict execution accuracy might mark this as a failure. “Execution-based focused evaluation” allows for a more nuanced comparison, checking if the target columns and values are correct, while being more lenient on extraneous data or row ordering.
  3. The “Soft-F1” Metric: To mitigate the binary nature of Execution Accuracy (where one wrong cell fails the entire test), Soft-F1 is increasingly used. This metric provides partial credit by calculating the overlap between the predicted and gold results. If a query returns 99 out of 100 correct rows, Soft-F1 reflects high performance, whereas EX would return 0. This is crucial for debugging.
  4. LLM-as-a-Judge: Sometimes execution is impossible (e.g., missing private data, environment errors). In these cases, an advanced LLM can be prompted to compare the semantic logic of the predicted SQL against the Gold SQL. While less objective than execution, it correlates highly with human judgment.

Spider 2.0: The Enterprise Reality Check

Currently there are three remarkable evaluation frameworks: Spider 2.0, BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL) and SynSQL (based on synthetic data). However, the industry has been suffering from a false sense of security created by outdated benchmarks. For years, the industry relied on Spider 1.0. It focused on small, clean SQLite databases (averaging fewer than 10 tables). Models were achieving 90%+ accuracy, leading many to believe the problem was “solved.”

The framework that I always emphasize, which incorporates these modern metrics and truly tests enterprise readiness, is Spider 2.0.

Spider 2.0 (released in conjunction with ICLR 2025) is a paradigm shift, designed to address this “reality gap” by introducing the complexities that break LLMs in production:

  1. Massive Scale: Enterprise schemas are huge. Spider 2.0 databases average 812 columns, with some exceeding 3,000. This scale often exceeds the LLM’s context limits, forcing models to employ “Schema Linking” (retrieval) strategies just to identify the relevant tables before generating SQL.
  2. Dialect Diversity: Real companies use Snowflake, BigQuery, and T-SQL, not just SQLite. Spider 2.0 enforces dialect diversity, requiring models to master specific syntax (e.g., handling nested JSON data using UNNEST or FLATTEN).
  3. External Knowledge: Business logic (like the definition of “Churn Rate”) resides in documentation or project codebases (like DBT), not the schema. Spider 2.0 simulates this by providing external files (Markdown, YAML) the model must read to ground its reasoning.
  4. The Agentic Workflow: Crucially, Spider 2.0 models the workflow of a modern data engineer. It moves beyond static translation, evaluating the model’s ability to explore the file system, read documentation, interact with live database instances, and debug errors iteratively.

The difference in difficulty is stark. Models that dominate Spider 1.0 see their success rates drop to 10-20% on the full Spider 2.0 benchmark, highlighting the deficiencies of current LLMs when faced with real-world complexity.

Conclusion: The Binary Bar for Enterprise Data

The journey from Business Intelligence to AI-driven analytics has been marked by increasing abstraction, but the fundamental requirement for data integrity remains unchanged. While the promise of Text-to-SQL is closer than ever, we must resist the allure of high scores on outdated benchmarks.

Achieving 90% accuracy might be academically interesting, but in the enterprise, it is industrially useless. The bar is binary: it works or it breaks trust.

As platforms like BigQuery simplify the integration of AI and data, it is imperative that we simultaneously adopt sophisticated evaluation methodologies and rigorous benchmarks like Spider 2.0. Only by testing against the messy reality of enterprise data can we develop Text-to-SQL applications reliable enough to bet the business on.

Until next time, I hope you have found this topic as fascinating as I do.

Further Reading

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows Authors: Fangyu Lei, Jixuan Chen, Yuxiao Ye, et al. Published: arXiv (Nov 2024), Accepted to ICLR 2025 (Oral). Link: https://arxiv.org/abs/2411.07763

Certainly, like anything with AI nowadays, this conversation does not have to end here. I would love to hear your inputs and perspective at www.gyza.org

Source link

#Accuracy #TexttoSQL #Useless

Tags: data scienceEditors PickLlmmachine learningtext-to-sql
Previous Post

Anthropic launches Cowork, a Claude Code-like for general computing

Next Post

Securing digital assets as crypto crime surges

AiNEWS2025

AiNEWS2025

Next Post
Securing digital assets as crypto crime surges

Securing digital assets as crypto crime surges

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
Scaling Global Trade with AI-Powered Tools for SMBs – with Kuo Zhang of Alibaba.com

Scaling Global Trade with AI-Powered Tools for SMBs – with Kuo Zhang of Alibaba.com

2026-01-13
Securing digital assets as crypto crime surges

Securing digital assets as crypto crime surges

2026-01-13
Why 90% Accuracy in Text-to-SQL is 100% Useless

Why 90% Accuracy in Text-to-SQL is 100% Useless

2026-01-13
Anthropic launches Cowork, a Claude Code-like for general computing

Anthropic launches Cowork, a Claude Code-like for general computing

2026-01-13

Recent News

Scaling Global Trade with AI-Powered Tools for SMBs – with Kuo Zhang of Alibaba.com

Scaling Global Trade with AI-Powered Tools for SMBs – with Kuo Zhang of Alibaba.com

2026-01-13
Securing digital assets as crypto crime surges

Securing digital assets as crypto crime surges

2026-01-13
Why 90% Accuracy in Text-to-SQL is 100% Useless

Why 90% Accuracy in Text-to-SQL is 100% Useless

2026-01-13
Anthropic launches Cowork, a Claude Code-like for general computing

Anthropic launches Cowork, a Claude Code-like for general computing

2026-01-13
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

Scaling Global Trade with AI-Powered Tools for SMBs – with Kuo Zhang of Alibaba.com

Scaling Global Trade with AI-Powered Tools for SMBs – with Kuo Zhang of Alibaba.com

2026-01-13
Securing digital assets as crypto crime surges

Securing digital assets as crypto crime surges

2026-01-13
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.