• About
  • Advertise
  • Privacy & Policy
  • Contact
Monday, January 12, 2026
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

How to Perform Comprehensive Large Scale LLM Validation

AiNEWS2025 by AiNEWS2025
2025-08-22
in Machine Learning
0
How to Perform Comprehensive Large Scale LLM Validation
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


and evaluations are critical to ensuring robust, high-performing LLM applications. However, such topics are often overlooked in the greater scheme of LLMs.

Imagine this scenario: You have an LLM query that replies correctly 999/1000 times when prompted. However, you have to run backfilling on 1.5 million items to populate the database. In this (very realistic) scenario, you’ll experience 1500 errors for this LLM prompt alone. Now scale this up to 10s, if not 100s of different prompts, and you’ve got a real scalability issue at hand.

The solution is to validate your LLM output and ensure high performance using evaluations, which are both topics I’ll discuss in this article

This infographic highlights the main contents of this article. I'll be discussing validation and evaluation of LLM outputs, Qualitative vs quantitative scoring, and dealing with large-scale LLM applications.
This infographic highlights the main contents of this article. I’ll be discussing validation and evaluation of LLM outputs, Qualitative vs quantitative scoring, and dealing with large-scale LLM applications. Image by ChatGPT.

Table of Contents

What is LLM validation and evaluation?

I think it’s essential to start by defining what LLM validation and evaluation are, and why they’re important for your application.

LLM validation is about validating the quality of your outputs. One common example of this is running some piece of code that checks if the LLM response answered the user’s question. Validation is important because it ensures you’re providing high-quality responses, and your LLM is performing as expected. Validation can be seen as something you do real time, on individual responses. For example, before returning the response to the user, you verify that the response is actually of high quality.

LLM evaluation is similar; however, it usually does not occur in real time. Evaluating your LLM output could, for example, involve looking at all the user queries from the last 30 days and quantitatively assessing how well your LLM performed.

Validating and evaluating your LLM’s performance is important because you will experience issues with the LLM output. It could, for example, be

  • Issues with input data (missing data)
  • An edge case your prompt is not equipped to handle
  • Data is out of distribution
  • Etc.

Thus, you need a robust solution for handling LLM output issues. You need to ensure you avoid them as often as possible and handle them in the remaining cases.

Murphy’s law adapted to this scenario:

On a large scale, everything that can go wrong, will go wrong

Qualitative vs quantitative assessments

Before moving on to the individual sections on performing validation and evaluations, I also want to comment on qualitative vs quantitative assessments of LLMs. When working with LLMs, it’s often tempting to manually evaluate the LLM’s performance for different prompts. However, such manual (qualitative) assessments are highly subject to biases. For example, you might focus most of your attention on the cases in which the LLM succeeded, and thus overestimate the performance of your LLM. Having the potential biases in mind when working with LLMs is important to mitigate the risk of biases influencing your ability to improve the model.

Large-scale LLM output validation

After running millions of LLM calls, I’ve seen a lot of different outputs, such as GPT-4o returning … or Qwen2.5 responding with unexpected Chinese characters in

These errors are incredibly difficult to detect with manual inspection because they usually happen in less than 1 out of 1000 API calls to the LLM. However, you need a mechanism to catch these issues when they occur in real time, on a large scale. Thus, I’ll discuss some approaches to handling these issues.

Simple if-else statement

The simplest solution for validation is to have some code that uses a simple if statement, which checks the LLM output. For example, if you want to generate summaries for documents, you might want to ensure the LLM output is at least above some minimal length

# LLM summay validation

# first generate summary through an LLM client such as OpenAI, Anthropic, Mistral, etc. 
summary = llm_client.chat(f"Make a summary of this document {document}")

# validate the summary
def validate_summary(summary: str) -> bool:
    if len(summary) 

Then you can run the validation.

  • If the validation passes, you can continue as usual
  • If it fails, you can choose to ignore the request or utilize a retry mechanism

You can, of course, make the validate_summary function more elaborate, for example:

  • Utilizing regex for complex string matching
  • Using a library such as Tiktoken to count the number of tokens in the request
  • Ensure specific words are present/not present in the response
  • etc.

LLM as a validator

This diagram highlights the flow of an LLM application utilizing an LLM as a validator. You first input the prompt, which here is to create a summary of a document. The LLM creates a summary of a document and sends it to an LLM validator. If the summary is valid, we return the request. However, if the summary is invalid, we can either ignore the request or retry it. Image by the author.

A more advanced and costly validator is using an LLM. In these cases, you utilize another LLM to assess if the output is valid. This works because validating correctness is usually a more straightforward task than generating a correct response. Using an LLM validator is essentially utilizing LLM as a judge, a topic I have written another Towards Data Science article about here.

I often utilize smaller LLMs to perform this validation task because they have faster response times, cost less, and still work well, considering that the task of validating is simpler than generating a correct response. For example, if I utilize GPT-4.1 to generate a summary, I would consider GPT-4.1-mini or GPT-4.1-nano to assess the validity of the generated summary.

Again, if the validation succeeds, you continue your application flow, and if it fails, you can ignore the request or choose to retry it.

In the case of validating the summary, I would prompt the validating LLM to look for summaries that:

  • Are too short
  • Don’t adhere to the expected answer format (for example, Markdown)
  • And other rules you may have for the generated summaries

Quantitative LLM evaluations

It is also super important to perform large-scale evaluations of LLM outputs. I recommend either running this continually, or in regular intervals. Quantitative LLM evaluations are also more effective when combined with qualitative assessments of data samples. For example, suppose the evaluation metrics highlight that your generated summaries are longer than what users prefer. In that case, you should manually look into those generated summaries and the documents they are based on. This helps you understand the underlying problem, which again makes solving the problem easier.

LLM as a judge

Same as with validation, you can utilize LLM as a judge for evaluation. The difference is that while validation uses LLM as a judge for binary predictions (either the output is valid, or it’s not valid), evaluation uses it for more detailed feedback. You can for example receive feedback from the LLM judge on the quality of a summary from 1-10, making it easier to distinguish medium quality summaries (around 4-6), from high quality summarie (7+).

Again, you have to consider costs when using LLM as a judge. Even though you may be utilizing smaller models, you are essentially doubling the number of LLM calls when using LLM as a judge. You can thus consider the following changes to save on costs:

  • Sampling data points, so you only run LLM as a judge on a subset of data points
  • Grouping several data points into one LLM as a judge prompt, to save on input and output tokens

I recommend detailing the judging criteria to the LLM judge. For example, you should state what constitutes a score of 1, a score of 5, and a score of 10. Using examples is often a great way of instructing LLMs, as discussed in my article on utilizing LLM as a judge. I often think about how helpful examples are for me when someone is explaining a topic, and you can thus imagine how helpful it is for an LLM.

User feedback

User feedback is a great way of receiving quantitative metrics on your LLM’s outputs. User feedback can, for example, be a thumbs-up or thumbs-down button, stating if the generated summary is satisfactory. If you combine such feedback from hundreds or thousands of users, you have a reliable feedback mechanism you can utilize to vastly improve the performance of your LLM summary generator!

These users can be your customers, so you should make it easy for them to provide feedback and encourage them to provide as much feedback as possible. However, these users can essentially be anyone who doesn’t utilize or develop your application on a day-to-day basis. It’s important to remember that any such feedback, will be incredibly valuable to improve the performance of your LLM, and it doesn’t really cost you (as the developer of the application), any time to gather this feedback..

Conclusion

In this article, I have discussed how you can perform large-scale validation and evaluation in your LLM application. Doing this is incredibly important to both ensure your application performs as expected and to improve your application based on user feedback. I recommend incorporating such validation and evaluation flows in your application as soon as possible, given the importance of ensuring that inherently unpredictable LLMs can reliably provide value in your application.

You can also read my articles on How to Benchmark LLMs with ARC AGI 3 and How to Effortlessly Extract Receipt Information with OCR and GPT-4o mini

👉 Find me on socials:

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Source link

#Perform #Comprehensive #Large #Scale #LLM #Validation

Tags: Editors PickLlmLlm Evaluationmachine learningValidation
Previous Post

Deeply divided Supreme Court lets NIH grant terminations continue

Next Post

In a first, Google has released data on how much energy an AI prompt uses

AiNEWS2025

AiNEWS2025

Next Post
In a first, Google has released data on how much energy an AI prompt uses

In a first, Google has released data on how much energy an AI prompt uses

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

2026-01-12
That time Will Smith helped discover new species of anaconda

That time Will Smith helped discover new species of anaconda

2026-01-12
Billy Woods’ Golliwog is a horrorcore masterpiece for the A24 crowd

Billy Woods’ Golliwog is a horrorcore masterpiece for the A24 crowd

2026-01-12
How to upgrade your ‘incompatible’ Windows 10 PC to Windows 11 – for free

How to upgrade your ‘incompatible’ Windows 10 PC to Windows 11 – for free

2026-01-12

Recent News

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

2026-01-12
That time Will Smith helped discover new species of anaconda

That time Will Smith helped discover new species of anaconda

2026-01-12
Billy Woods’ Golliwog is a horrorcore masterpiece for the A24 crowd

Billy Woods’ Golliwog is a horrorcore masterpiece for the A24 crowd

2026-01-12
How to upgrade your ‘incompatible’ Windows 10 PC to Windows 11 – for free

How to upgrade your ‘incompatible’ Windows 10 PC to Windows 11 – for free

2026-01-12
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

2026-01-12
That time Will Smith helped discover new species of anaconda

That time Will Smith helped discover new species of anaconda

2026-01-12
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.