How to Use LLMs for Powerful Automatic Evaluations

discuss how you can perform automatic evaluations using LLM as a judge. LLMs are widely used today for a variety of applications. However, an often underestimated aspect of LLMs is their use case for evaluation. With LLM as a judge, you utilize LLMs to judge the quality of an output, whether it be giving it a score between 1 and 10, comparing two outputs, or providing pass/fail feedback. The goal of the article is to provide insights into how you can utilize LLM as a judge for your own application, to make development more effective.

This infographic highlights the contents of my article. Image by ChatGPT.

You can also read my article on Benchmarking LLMs with ARC AGI 3 and check out my website, which contains all my information and articles.

Motivation

My motivation for writing this article is that I work daily on different LLM applications. I’ve read more and more about using LLM as a judge, and I started reading up on the topic. I believe utilizing LLMs for automated evaluations of machine-learning systems is a super powerful aspect of LLMs that’s often underestimated.

Using LLM as a judge can save you enormous amounts of time, considering it can automate either part of, or the whole, evaluation process. Evaluations are critical for machine-learning systems to ensure they perform as intended. However, evaluations are also time-consuming, and you thus want to automate them as much as possible.

One powerful example use case for LLM as a judge is in a question-answering system. You can gather a series of input-output examples for two different versions of a prompt. Then you can ask the LLM judge to respond with whether the outputs are equal (or the latter prompt version output is better), and thus ensure changes in your application do not have a negative impact on performance. This can, for example, be used pre-deployment of new prompts.

Definition

I define LLM as a judge, as any case where you prompt an LLM to evaluate the output of a system. The system is primarily machine-learning-based, though this is not a requirement. You simply provide the LLM with a set of instructions on how to evaluate the system, providing information such as what’s important for the evaluation and what evaluation metric should be used. The output can then be processed to continue deployment or stop the deployment because the quality is deemed lower. This eliminates the time-consuming and inconsistent step of manually reviewing LLM outputs before making changes to your application.

LLM as a judge evaluation methods

LLM as a judge can be used for a variety of applications, such as:

Question answering systems
Classification systems
Information extraction systems
…

Different applications will require different evaluation methods, so I will describe three different methods below

Compare two outputs

Comparing two outputs is a great use of LLM as a judge. With this evaluation metric, you compare the output of two different models.

The difference between the models can, for example, be:

Different input prompts
Different LLMs (i.e., OpenAI GPT4o vs Claude Sonnet 4.0)
Different embedding models for RAG

You then provide the LLM judge with four items:

The input prompt(s)
Output from model 1
Output from model 2
Instructions on how to perform the evaluation

You can then ask the LLM judge to provide one of the three following outputs:

Equal (the essence of the outputs is the same)
Output 1 (the first model is better)
Output 2 (the second model is better).

You can, for example, use this in the scenario I described earlier, if you want to update the input prompt. You can then ensure that the updated prompt is equal to or better than the previous prompt. If the LLM judge informs you that all test samples are either equal or the new prompt is better, you can likely automatically deploy the updates.

Score outputs

Another evaluation metric you can use for LLM as a judge is to provide the output a score, for example, between 1 and 10. In this scenario, you need to provide the LLM judge with the following:

Instructions for performing the evaluation
The input prompt
The output

In this evaluation method, it’s critical to provide clear instructions to the LLM judge, considering that providing a score is a subjective task. I strongly recommend providing examples of outputs that resemble a score of 1, a score of 5, and a score of 10. This provides the model with different anchors it can utilize to provide a more accurate score. You can also try using fewer possible scores, for example, only scores of 1, 2, and 3. Fewer options will increase the model accuracy, at the cost of making smaller differences harder to differentiate, because of less granularity.

The scoring evaluation metric is useful for running larger experiments, comparing different prompt versions, models, and so on. You can then utilize the average score over a larger test set to accurately judge which approach works best.

Pass/fail

Pass or fail is another common evaluation metric for LLM as a judge. In this scenario, you ask the LLM judge to either approve or disapprove the output, given a description of what constitutes a pass and what constitutes a fail. Similar to the scoring evaluation, this description is critical to the performance of the LLM judge. Again, I recommend using examples, essentially utilizing few-shot learning to make the LLM judge more accurate. You can read more about few-shot learning in my article on context engineering.

The pass fail evaluation metric is useful for RAG systems to judge if a model correctly answered a question. You can, for example, provide the fetched chunks and the output of the model to determine whether the RAG system answers correctly.

Important notes

Compare with a human evaluator

I also have a few important notes regarding LLM as a judge, from working on it myself. The number one learning is that while LLM as a judge system can save you large amounts of time, it can also be unreliable. When implementing the LLM judge, you thus need to test the system manually, ensuring the LLM as a judge system responds similarly to a human evaluator. This should preferably be performed as a blind test. For example, you can set up a series of pass/fail examples, and see how often the LLM judge system agrees with the human evaluator.

Cost

Another important note to keep in mind is the cost. The cost of LLM requests is trending downwards, but when developing an LLM as a judge system, you are also performing a lot of requests. I would thus keep this in mind and perform estimations on the cost of the system. For example, if each LLM as a judge runs costs 10 USD, and you, on average, perform five such runs a day, you incur a cost of 50 USD per day. You may need to evaluate whether this is an acceptable price for more effective development, or if you should reduce the cost of the LLM as a judge system. You can for example reduce the cost by using cheaper models (GPT-4o-mini instead of GPT-4o), or reduce the number of test examples.

Conclusion

In this article, I have discussed how LLM as a judge works and how you can utilize it to make development more effective. LLM as a judge is an often overlooked aspect of LLMs, which can be incredibly powerful, for example, pre-deployments to ensure your question answering system still works on historic queries.

I discussed different evaluation methods, with how and when you should utilize them. LLM as a judge is a flexible system, and you need to adapt it to whichever scenario you are implementing. Lastly, I also discussed some important notes, for example, comparing the LLM judge with a human evaluator.

👉 Find me on socials:

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Source link

#LLMs #Powerful #Automatic #Evaluations