How to Analyze and Optimize Your LLMs in 3 Steps

in production, actively responding to user queries. However, you now want to improve your model to handle a larger fraction of customer requests successfully. How do you approach this?

In this article, I discuss the scenario where you already have a running LLM and want to analyze and optimize its performance. I will discuss the approaches I use to uncover where the LLM works and where it needs improvement. Furthermore, I’ll also discuss the tools I use to improve my LLM’s performance, with tools such as Anthropic’s prompt optimizer.

In short, I follow a three-step process to quickly improve my LLM’s performance:

Analyze LLM outputs
Iteratively improve areas with the most value to effort
Evaluate and iterate

Motivation

My motivation for this article is that I often find myself in the scenario described in the intro. I already have my LLM up and running; however, it’s not performing as expected or reaching customer expectations. Through countless experiences of analyzing my LLMs, I have created this simple three-step process I always use to improve LLMs.

Step 1: Analyzing LLM outputs

The first step to improving your LLMs should always be to analyze their output. To have high observability in your platform, I strongly recommend using an LLM manager tool for tracing, such as Langfuse or PromptLayer. These tools make it simple to gather all your LLM invocations in one place, ready for analysis.

I’ll now discuss some different approaches I apply to analyze my LLM outputs.

Manual inspection of raw output

The simplest approach to analyze your LLM output is to manually inspect many of your LLM invocations. You should gather your last 50 LLM invocations, read through the entire context you fed into the model, and the output the model provided. I find this approach surprisingly effective in uncovering problems. I have, for example, discovered:

Duplicate context (part of my context was duplicated due to a programming error)
Missing context (I wasn’t feeding all the information I expected into my LLM)
etc.

Manual inspection of data should never be underestimated. Thoroughly looking through the data manually gives you an understanding of the dataset you are working on, which is hard to obtain in any other manner. Furthermore, I also find that I should manually inspect more data points than I initially want to spend time evaluating.

For example, let’s say it takes 5 minutes to manually inspect one input-output example. My intuition often tells me to maybe spend 20-30 minutes on this, and thus inspect 4-6 data points. However, I find that you should usually spend a lot longer on this part of the process. I recommend at least 5x-ing this time, so instead of spending 30 minutes manually inspecting, you spend 2.5 hours. Initially, you’ll think this is a lot of time to spend on manual inspection, but you’ll usually find it saves you plenty of time in the long run. Additionally, compared to an entire 3-week project, 2.5 hours is an insignificant amount of time.

Group queries according to taxonomy

Sometimes, you’ll not get all your answers from simple manual analysis of your data. In those instances, I would move over to more quantitative analysis of my data. This is as opposed to the first approach, which I consider qualitative since I’m manually inspecting each data point.

Grouping user queries according to a taxonomy is an efficient approach to better understand what users expect from your LLM. I’ll provide an example to make this easier to understand:

Imagine you’re Amazon, and you have a customer service LLM handling incoming customer questions. In this instance, a taxonomy will look something like:

Refund requests
Talk to a human requests
Questions about individual products
…

I would then look at the last 1000 user queries and manually annotate them into this taxonomy. This will tell you which questions are most prevalent, and which ones you should focus most on answering correctly. You’ll often find that the distribution of items in each category will follow a Pareto distribution, with most items belonging to a few specific categories.

Additionally, you annotate whether a customer request was successfully answered or not. With this information, you can now discover what kinds of questions you’re struggling with and which ones your LLM is good at. Maybe the LLM easily transfers customer queries to humans when requested; however, it struggles when queried about details about a product. In this instance, you should focus your effort on improving the group of questions you’re struggling with the most.

LLM as a judge on a golden dataset

Another quantitative approach I use to analyze my LLM outputs is to create a golden dataset of input-output examples and utilize LLM as a judge. This will help when you make changes to your LLM.

Continuing on the customer support example from previously, you can create a list of 50 (real) user queries and the desired response from each of them. Whenever you make changes to your LLM (change model version, add more context, …), you can automatically test the new LLM on the golden dataset, and have an LLM as a judge determine if the response from the new model is at least as good as the response from the old model. This will save you vast amounts of time manually inspecting LLM outputs whenever you update your LLM.

If you want to learn more about LLM as a judge, you can read my TDS article on the topic here.

Step 2: Iteratively improving your LLM

You’re done with step one, and you now want to use those insights to improve your LLM. In this section, I discuss how I approach this step to efficiently improve the performance of my LLM.

If I discover significant issues, for example, when manually inspecting data, I always fix those first. This can, for example, be discovering unnecessary noise being added to the LLM’s context, or typos in my prompts. When I’m done with that, I continue using some tools.

One tool I use is prompt optimizers, such as Anthropic’s prompt improver. With these tools, you typically input your prompt and some input-output examples. You can, for example, input the prompt you use for your customer service agents, along with examples of customer interactions where the LLM failed. The prompt optimizer will analyze your prompt and examples and return an improved version of your prompt. You’ll likely see improvements such as:

Improved structure in your prompt, for example, using Markdown
Handling of edge cases. For example, handling cases where the user queries the customer support agent about completely unrelated topics, such as asking “What is the weather in New York today?”. The prompt optimizer might add something like “If the question is not related to Amazon, tell the user that you’re only designed to answer questions about Amazon”.

If I have more quantitative data, such as from grouping user queries or a golden dataset, I also analyze these data, and create a value effort graph. The value effort graph highlights the different available improvements you can make, such as:

Improved edge case handling in the system prompt
Use a better embedding model for improved RAG

You then plot these data points in a 2D grid, such as below. You should naturally prioritize items in the upper left quadrant because they provide a lot of value and require little effort. Normally, however, items are contained on a diagonal, where improved value correlates strongly with higher required effort.

This figure shows a value effort graph. The value effort graph displays different improvements you can make to your product. The improvements are displayed in the graph according to how valuable they are and the effort required to build them. Image by ChatGPT.

I put all my improvement suggestions into a value-effort graph, and then gradually pick items that are as high as possible in value, and as low as possible in effort. This is a super effective approach to quickly solve the most pressing issues with your LLM, positively impacting the largest number of customers you can for a given amount of effort.

Step 3: Evaluate and iterate

The last step in my three-step process is to evaluate my LLM and iterate. There are a plethora of techniques you can use to evaluate your LLM, a lot of which I cover in my article on the topic.

Preferably, you create some quantitative metrics for your LLMs’ performance, and ensure those metrics have improved from the changes you applied in step 2. After applying these changes and verifying they improved your LLM, you should consider whether the model is good enough or if you should continue improving the model. I most often operate on the 80% principle, which states that 80% performance is good enough in almost all cases. This is not a literal 80% as in accuracy. It rather highlights the point that you don’t need to create a perfect model, but rather only create a model that is good enough.

Conclusion

In this article, I have discussed the scenario where you already have an LLM in production, and you want to analyze and improve your LLM. I approach this scenario by first analyzing the model inputs and outputs, preferably by full manual inspection. After ensuring I really understand the dataset and how the model behaves, I also move into more quantitative metrics, such as grouping queries into a taxonomy and using LLM as a judge. Following this, I implement improvements based on my findings in the previous step, and lastly, I evaluate whether my improvements worked as intended.

👉 Find me on socials:

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Or read my other articles:

Source link

#Analyze #Optimize #LLMs #Steps