• About
  • Advertise
  • Privacy & Policy
  • Contact
Monday, December 29, 2025
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

How to Analyze and Optimize Your LLMs in 3 Steps

AiNEWS2025 by AiNEWS2025
2025-09-12
in Machine Learning
0
How to Analyze and Optimize Your LLMs in 3 Steps
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


in production, actively responding to user queries. However, you now want to improve your model to handle a larger fraction of customer requests successfully. How do you approach this?

In this article, I discuss the scenario where you already have a running LLM and want to analyze and optimize its performance. I will discuss the approaches I use to uncover where the LLM works and where it needs improvement. Furthermore, I’ll also discuss the tools I use to improve my LLM’s performance, with tools such as Anthropic’s prompt optimizer.

In short, I follow a three-step process to quickly improve my LLM’s performance:

  1. Analyze LLM outputs
  2. Iteratively improve areas with the most value to effort
  3. Evaluate and iterate

Table of Contents

Motivation

My motivation for this article is that I often find myself in the scenario described in the intro. I already have my LLM up and running; however, it’s not performing as expected or reaching customer expectations. Through countless experiences of analyzing my LLMs, I have created this simple three-step process I always use to improve LLMs.

Step 1: Analyzing LLM outputs

The first step to improving your LLMs should always be to analyze their output. To have high observability in your platform, I strongly recommend using an LLM manager tool for tracing, such as Langfuse or PromptLayer. These tools make it simple to gather all your LLM invocations in one place, ready for analysis.

I’ll now discuss some different approaches I apply to analyze my LLM outputs.

Manual inspection of raw output

The simplest approach to analyze your LLM output is to manually inspect many of your LLM invocations. You should gather your last 50 LLM invocations, read through the entire context you fed into the model, and the output the model provided. I find this approach surprisingly effective in uncovering problems. I have, for example, discovered:

  • Duplicate context (part of my context was duplicated due to a programming error)
  • Missing context (I wasn’t feeding all the information I expected into my LLM)
  • etc.

Manual inspection of data should never be underestimated. Thoroughly looking through the data manually gives you an understanding of the dataset you are working on, which is hard to obtain in any other manner. Furthermore, I also find that I should manually inspect more data points than I initially want to spend time evaluating.

For example, let’s say it takes 5 minutes to manually inspect one input-output example. My intuition often tells me to maybe spend 20-30 minutes on this, and thus inspect 4-6 data points. However, I find that you should usually spend a lot longer on this part of the process. I recommend at least 5x-ing this time, so instead of spending 30 minutes manually inspecting, you spend 2.5 hours. Initially, you’ll think this is a lot of time to spend on manual inspection, but you’ll usually find it saves you plenty of time in the long run. Additionally, compared to an entire 3-week project, 2.5 hours is an insignificant amount of time.

Group queries according to taxonomy

Sometimes, you’ll not get all your answers from simple manual analysis of your data. In those instances, I would move over to more quantitative analysis of my data. This is as opposed to the first approach, which I consider qualitative since I’m manually inspecting each data point.

Grouping user queries according to a taxonomy is an efficient approach to better understand what users expect from your LLM. I’ll provide an example to make this easier to understand:

Imagine you’re Amazon, and you have a customer service LLM handling incoming customer questions. In this instance, a taxonomy will look something like:

  • Refund requests
  • Talk to a human requests
  • Questions about individual products
  • …

I would then look at the last 1000 user queries and manually annotate them into this taxonomy. This will tell you which questions are most prevalent, and which ones you should focus most on answering correctly. You’ll often find that the distribution of items in each category will follow a Pareto distribution, with most items belonging to a few specific categories.

Additionally, you annotate whether a customer request was successfully answered or not. With this information, you can now discover what kinds of questions you’re struggling with and which ones your LLM is good at. Maybe the LLM easily transfers customer queries to humans when requested; however, it struggles when queried about details about a product. In this instance, you should focus your effort on improving the group of questions you’re struggling with the most.

LLM as a judge on a golden dataset

Another quantitative approach I use to analyze my LLM outputs is to create a golden dataset of input-output examples and utilize LLM as a judge. This will help when you make changes to your LLM.

Continuing on the customer support example from previously, you can create a list of 50 (real) user queries and the desired response from each of them. Whenever you make changes to your LLM (change model version, add more context, …), you can automatically test the new LLM on the golden dataset, and have an LLM as a judge determine if the response from the new model is at least as good as the response from the old model. This will save you vast amounts of time manually inspecting LLM outputs whenever you update your LLM.

If you want to learn more about LLM as a judge, you can read my TDS article on the topic here.

Step 2: Iteratively improving your LLM

You’re done with step one, and you now want to use those insights to improve your LLM. In this section, I discuss how I approach this step to efficiently improve the performance of my LLM.

If I discover significant issues, for example, when manually inspecting data, I always fix those first. This can, for example, be discovering unnecessary noise being added to the LLM’s context, or typos in my prompts. When I’m done with that, I continue using some tools.

One tool I use is prompt optimizers, such as Anthropic’s prompt improver. With these tools, you typically input your prompt and some input-output examples. You can, for example, input the prompt you use for your customer service agents, along with examples of customer interactions where the LLM failed. The prompt optimizer will analyze your prompt and examples and return an improved version of your prompt. You’ll likely see improvements such as:

  • Improved structure in your prompt, for example, using Markdown
  • Handling of edge cases. For example, handling cases where the user queries the customer support agent about completely unrelated topics, such as asking “What is the weather in New York today?”. The prompt optimizer might add something like “If the question is not related to Amazon, tell the user that you’re only designed to answer questions about Amazon”.

If I have more quantitative data, such as from grouping user queries or a golden dataset, I also analyze these data, and create a value effort graph. The value effort graph highlights the different available improvements you can make, such as:

  • Improved edge case handling in the system prompt
  • Use a better embedding model for improved RAG

You then plot these data points in a 2D grid, such as below. You should naturally prioritize items in the upper left quadrant because they provide a lot of value and require little effort. Normally, however, items are contained on a diagonal, where improved value correlates strongly with higher required effort.

This figure shows a value effort graph. The value effort graph displays different improvements you can make to your product. The improvements are displayed in the graph according to how valuable they are and the effort required to build them. Image by ChatGPT.

I put all my improvement suggestions into a value-effort graph, and then gradually pick items that are as high as possible in value, and as low as possible in effort. This is a super effective approach to quickly solve the most pressing issues with your LLM, positively impacting the largest number of customers you can for a given amount of effort.

Step 3: Evaluate and iterate

The last step in my three-step process is to evaluate my LLM and iterate. There are a plethora of techniques you can use to evaluate your LLM, a lot of which I cover in my article on the topic.

Preferably, you create some quantitative metrics for your LLMs’ performance, and ensure those metrics have improved from the changes you applied in step 2. After applying these changes and verifying they improved your LLM, you should consider whether the model is good enough or if you should continue improving the model. I most often operate on the 80% principle, which states that 80% performance is good enough in almost all cases. This is not a literal 80% as in accuracy. It rather highlights the point that you don’t need to create a perfect model, but rather only create a model that is good enough.

Conclusion

In this article, I have discussed the scenario where you already have an LLM in production, and you want to analyze and improve your LLM. I approach this scenario by first analyzing the model inputs and outputs, preferably by full manual inspection. After ensuring I really understand the dataset and how the model behaves, I also move into more quantitative metrics, such as grouping queries into a taxonomy and using LLM as a judge. Following this, I implement improvements based on my findings in the previous step, and lastly, I evaluate whether my improvements worked as intended.

👉 Find me on socials:

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Or read my other articles:

Source link

#Analyze #Optimize #LLMs #Steps

Tags: chatbotDatasetEvaluationLlmOptimization
Previous Post

OpenAI and Microsoft sign preliminary deal to revise partnership terms

Next Post

Texas banned lab-grown meat. What’s next for the industry?

AiNEWS2025

AiNEWS2025

Next Post
Texas banned lab-grown meat. What’s next for the industry?

Texas banned lab-grown meat. What’s next for the industry?

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
Breaking the Hardware Barrier: Software FP8 for Older GPUs

Breaking the Hardware Barrier: Software FP8 for Older GPUs

2025-12-28
How AI coding agents work—and what to remember if you use them

How AI coding agents work—and what to remember if you use them

2025-12-28
You need to read the subversive cosmic horror novella The Ballad of Black Tom

You need to read the subversive cosmic horror novella The Ballad of Black Tom

2025-12-28
Days After Mass Bricking Event, Waymo Fleet Shuts Down Again

Days After Mass Bricking Event, Waymo Fleet Shuts Down Again

2025-12-28

Recent News

Breaking the Hardware Barrier: Software FP8 for Older GPUs

Breaking the Hardware Barrier: Software FP8 for Older GPUs

2025-12-28
How AI coding agents work—and what to remember if you use them

How AI coding agents work—and what to remember if you use them

2025-12-28
You need to read the subversive cosmic horror novella The Ballad of Black Tom

You need to read the subversive cosmic horror novella The Ballad of Black Tom

2025-12-28
Days After Mass Bricking Event, Waymo Fleet Shuts Down Again

Days After Mass Bricking Event, Waymo Fleet Shuts Down Again

2025-12-28
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

Breaking the Hardware Barrier: Software FP8 for Older GPUs

Breaking the Hardware Barrier: Software FP8 for Older GPUs

2025-12-28
How AI coding agents work—and what to remember if you use them

How AI coding agents work—and what to remember if you use them

2025-12-28
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.