• About
  • Advertise
  • Privacy & Policy
  • Contact
Monday, January 12, 2026
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

How to Use LLMs for Powerful Automatic Evaluations

AiNEWS2025 by AiNEWS2025
2025-08-14
in Machine Learning
0
How to Use LLMs for Powerful Automatic Evaluations
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


discuss how you can perform automatic evaluations using LLM as a judge. LLMs are widely used today for a variety of applications. However, an often underestimated aspect of LLMs is their use case for evaluation. With LLM as a judge, you utilize LLMs to judge the quality of an output, whether it be giving it a score between 1 and 10, comparing two outputs, or providing pass/fail feedback. The goal of the article is to provide insights into how you can utilize LLM as a judge for your own application, to make development more effective.

This infographic highlights the contents of my article. Image by ChatGPT.

You can also read my article on Benchmarking LLMs with ARC AGI 3 and check out my website, which contains all my information and articles.

Table of contents

Motivation

My motivation for writing this article is that I work daily on different LLM applications. I’ve read more and more about using LLM as a judge, and I started reading up on the topic. I believe utilizing LLMs for automated evaluations of machine-learning systems is a super powerful aspect of LLMs that’s often underestimated.

Using LLM as a judge can save you enormous amounts of time, considering it can automate either part of, or the whole, evaluation process. Evaluations are critical for machine-learning systems to ensure they perform as intended. However, evaluations are also time-consuming, and you thus want to automate them as much as possible.

One powerful example use case for LLM as a judge is in a question-answering system. You can gather a series of input-output examples for two different versions of a prompt. Then you can ask the LLM judge to respond with whether the outputs are equal (or the latter prompt version output is better), and thus ensure changes in your application do not have a negative impact on performance. This can, for example, be used pre-deployment of new prompts.

Definition

I define LLM as a judge, as any case where you prompt an LLM to evaluate the output of a system. The system is primarily machine-learning-based, though this is not a requirement. You simply provide the LLM with a set of instructions on how to evaluate the system, providing information such as what’s important for the evaluation and what evaluation metric should be used. The output can then be processed to continue deployment or stop the deployment because the quality is deemed lower. This eliminates the time-consuming and inconsistent step of manually reviewing LLM outputs before making changes to your application.

LLM as a judge evaluation methods

LLM as a judge can be used for a variety of applications, such as:

  • Question answering systems
  • Classification systems
  • Information extraction systems
  • …

Different applications will require different evaluation methods, so I will describe three different methods below

Compare two outputs

Comparing two outputs is a great use of LLM as a judge. With this evaluation metric, you compare the output of two different models.

The difference between the models can, for example, be:

  • Different input prompts
  • Different LLMs (i.e., OpenAI GPT4o vs Claude Sonnet 4.0)
  • Different embedding models for RAG

You then provide the LLM judge with four items:

  • The input prompt(s)
  • Output from model 1
  • Output from model 2
  • Instructions on how to perform the evaluation

You can then ask the LLM judge to provide one of the three following outputs:

  • Equal (the essence of the outputs is the same)
  • Output 1 (the first model is better)
  • Output 2 (the second model is better).

You can, for example, use this in the scenario I described earlier, if you want to update the input prompt. You can then ensure that the updated prompt is equal to or better than the previous prompt. If the LLM judge informs you that all test samples are either equal or the new prompt is better, you can likely automatically deploy the updates.

Score outputs

Another evaluation metric you can use for LLM as a judge is to provide the output a score, for example, between 1 and 10. In this scenario, you need to provide the LLM judge with the following:

  • Instructions for performing the evaluation
  • The input prompt
  • The output

In this evaluation method, it’s critical to provide clear instructions to the LLM judge, considering that providing a score is a subjective task. I strongly recommend providing examples of outputs that resemble a score of 1, a score of 5, and a score of 10. This provides the model with different anchors it can utilize to provide a more accurate score. You can also try using fewer possible scores, for example, only scores of 1, 2, and 3. Fewer options will increase the model accuracy, at the cost of making smaller differences harder to differentiate, because of less granularity.

The scoring evaluation metric is useful for running larger experiments, comparing different prompt versions, models, and so on. You can then utilize the average score over a larger test set to accurately judge which approach works best.

Pass/fail

Pass or fail is another common evaluation metric for LLM as a judge. In this scenario, you ask the LLM judge to either approve or disapprove the output, given a description of what constitutes a pass and what constitutes a fail. Similar to the scoring evaluation, this description is critical to the performance of the LLM judge. Again, I recommend using examples, essentially utilizing few-shot learning to make the LLM judge more accurate. You can read more about few-shot learning in my article on context engineering.

The pass fail evaluation metric is useful for RAG systems to judge if a model correctly answered a question. You can, for example, provide the fetched chunks and the output of the model to determine whether the RAG system answers correctly.

Important notes

Compare with a human evaluator

I also have a few important notes regarding LLM as a judge, from working on it myself. The number one learning is that while LLM as a judge system can save you large amounts of time, it can also be unreliable. When implementing the LLM judge, you thus need to test the system manually, ensuring the LLM as a judge system responds similarly to a human evaluator. This should preferably be performed as a blind test. For example, you can set up a series of pass/fail examples, and see how often the LLM judge system agrees with the human evaluator.

Cost

Another important note to keep in mind is the cost. The cost of LLM requests is trending downwards, but when developing an LLM as a judge system, you are also performing a lot of requests. I would thus keep this in mind and perform estimations on the cost of the system. For example, if each LLM as a judge runs costs 10 USD, and you, on average, perform five such runs a day, you incur a cost of 50 USD per day. You may need to evaluate whether this is an acceptable price for more effective development, or if you should reduce the cost of the LLM as a judge system. You can for example reduce the cost by using cheaper models (GPT-4o-mini instead of GPT-4o), or reduce the number of test examples.

Conclusion

In this article, I have discussed how LLM as a judge works and how you can utilize it to make development more effective. LLM as a judge is an often overlooked aspect of LLMs, which can be incredibly powerful, for example, pre-deployments to ensure your question answering system still works on historic queries.

I discussed different evaluation methods, with how and when you should utilize them. LLM as a judge is a flexible system, and you need to adapt it to whichever scenario you are implementing. Lastly, I also discussed some important notes, for example, comparing the LLM judge with a human evaluator.

👉 Find me on socials:

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Source link

#LLMs #Powerful #Automatic #Evaluations

Tags: artificial intelligencedata scienceLlmLlm Evaluationmachine learning
Previous Post

Report: Apple’s smart home ambitions include “tabletop robot,” cameras, and more

Next Post

The Download: Trump’s golden dome, and fueling AI with nuclear power

AiNEWS2025

AiNEWS2025

Next Post
The Download: Trump’s golden dome, and fueling AI with nuclear power

The Download: Trump's golden dome, and fueling AI with nuclear power

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
How to Leverage Slash Commands to Code Effectively

How to Leverage Slash Commands to Code Effectively

2026-01-11
The oceans just keep getting hotter

The oceans just keep getting hotter

2026-01-11
The full history of TiVo, and how it changed TV forever

The full history of TiVo, and how it changed TV forever

2026-01-11
Doomsday Glacier Bombarded by Earthquakes

Doomsday Glacier Bombarded by Earthquakes

2026-01-11

Recent News

How to Leverage Slash Commands to Code Effectively

How to Leverage Slash Commands to Code Effectively

2026-01-11
The oceans just keep getting hotter

The oceans just keep getting hotter

2026-01-11
The full history of TiVo, and how it changed TV forever

The full history of TiVo, and how it changed TV forever

2026-01-11
Doomsday Glacier Bombarded by Earthquakes

Doomsday Glacier Bombarded by Earthquakes

2026-01-11
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

How to Leverage Slash Commands to Code Effectively

How to Leverage Slash Commands to Code Effectively

2026-01-11
The oceans just keep getting hotter

The oceans just keep getting hotter

2026-01-11
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.