Measuring What Matters with NeMo Agent Toolkit

a decade working in analytics, I firmly believe that observability and evaluation are essential for any LLM application running in production. Monitoring and metrics aren’t just nice-to-haves. They ensure your product is functioning as expected and that each new update is actually moving you in the right direction.

In this article, I want to share my experience with the observability and evaluation features of the NeMo Agent Toolkit (NAT). If you haven’t read my previous article on NAT, here’s a quick refresher: NAT is Nvidia’s framework for building production-ready LLM applications. Think of it as the glue that connects LLMs, tools, and workflows, while also offering deployment and observability options.

Using NAT, we built a Happiness Agent capable of answering nuanced questions about the World Happiness report data and performing calculations based on real metrics. Our focus was on building agentic flows, integrating agents from other frameworks as tools (in our example, a LangGraph-based calculator agent), and deploying the application both as a REST API and a user-friendly interface.

In this article, I’ll dive into my favourite topics: observability and evaluations. After all, as the saying goes, you can’t improve what you don’t measure. So, without further ado, let’s jump in.

Observability

Let’s start with observability — the ability to track what’s happening inside your application, including all intermediate steps, tools used, timings, and token usage. The NeMo Agent Toolkit integrates with a variety of observability tools such as Phoenix, W&B Weave, and Catalyst. You can always check the latest list of supported frameworks in the documentation.

For this article, we’ll try Phoenix. Phoenix is an open-source platform for tracing and evaluating LLMs. Before we can start using it, we first need to install the plugin.

uv pip install arize-phoenix
uv pip install "nvidia-nat[phoenix]"

Next, we can launch the Phoenix server.

phoenix server

Once it’s running, the tracing service will be available at http://localhost:6006/v1/traces. At this point, you’ll see a default project since we haven’t sent any data yet.

Now, that the Phoenix server is running, let’s see how we can start using it. Since NAT is based on YAML configuration, all we need to do is add a telemetry section to our config. You can find the config and full agent implementation on GitHub. If you want to learn more about the NAT framework, check my previous article.

general:                                             
  telemetry:                                          
    tracing:                                          
      phoenix:                                        
        _type: phoenix                               
        endpoint: http://localhost:6006/v1/traces 
        project: happiness_report

With this in place, we can run our agent.

export ANTHROPIC_API_KEY=
source .venv_nat_uv/bin/activate
cd happiness_v3 
uv pip install -e . 
cd .. 
nat run \
  --config_file happiness_v3/src/happiness_v3/configs/config.yml \
  --input "How much happier in percentages are people in Finland compared to the United Kingdom?"

Let’s run a few more queries to see what kind of data Phoenix can track.

nat run \
  --config_file happiness_v3/src/happiness_v3/configs/config.yml \
  --input "Are people overall getting happier over time?"

nat run \
  --config_file happiness_v3/src/happiness_v3/configs/config.yml \
  --input "Is Switzerland on the first place?"

nat run \
  --config_file happiness_v3/src/happiness_v3/configs/config.yml \
  --input "What is the main contibutor to the happiness in the United Kingdom?"

nat run \
  --config_file happiness_v3/src/happiness_v3/configs/config.yml \
  --input "Are people in France happier than in Germany?"

After running these queries, you’ll notice a new project in Phoenix (happiness_report, as we defined in the config) along with all the LLM calls we just made. This gives you a clear view of what’s happening under the hood.

We can zoom in on one of the queries, like “Are people overall getting happier over time?”

This query takes quite a while (about 25 seconds) because it involves five tool calls for each year. If we expect a lot of similar questions about overall trends, it might make sense to give our agent a new tool that can calculate summary statistics all at once.

This is exactly where observability shines: by revealing bottlenecks and inefficiencies, it helps you reduce costs and deliver a smoother experience for users.

Evaluations

Observability is about tracing how your application works in production. This information is helpful, but it is not enough to say whether the quality of answers is good enough or whether a new version is performing better. To answer such questions, we need evaluations. Fortunately, the NeMo Agent Toolkit can help us with evals as well.

First, let’s put together a small set of evaluations. We need to specify just 3 fields: id, question and answer.

[
  {
    "id": "1",
    "question": "In what country was the happiness score highest in 2021?",
    "answer": "Finland"
  }, 
  {
    "id": "2",
    "question": "What contributed most to the happiness score in 2024?",
    "answer": "Social Support"
  }, 
  {
    "id": "3",
    "question": "How UK's rank changed from 2019 to 2024?",
    "answer": "The UK's rank dropped from 13th in 2019 to 23rd in 2024."
  },
  {
    "id": "4",
    "question": "Are people in France happier than in Germany based on the latest report?",
    "answer": "No, Germany is at 22nd place in 2024 while France is at 33rd place."
  },
  {
    "id": "5",
    "question": "How much in percents are people in Poland happier in 2024 compared to 2019?",
    "answer": "Happiness in Poland increased by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024."
  }
]

Next, we need to update our YAML config to define where to store evaluation results and where to find the evaluation dataset. I set up a dedicated eval_llm for evaluation purposes to keep the solution modular, and I’m using Sonnet 4.5 for it.

# Evaluation configuration
eval:
  general:
    output:
      dir: ./tmp/nat/happiness_v3/eval/evals/
      cleanup: false  
    dataset:
      _type: json
      file_path: src/happiness_v3/data/evals.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: eval_llm
    groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: eval_llm
    trajectory_accuracy:
      _type: trajectory
      llm_name: eval_llm

I’ve defined several evaluators here. We’ll focus on Answer Accuracy and Response Groundedness from Ragas (an open-source framework for evaluating LLM workflows end-to-end), as well as trajectory evaluation. Let’s break them down.

Answer Accuracy measures how well a model’s response aligns with a reference ground truth. It uses two “LLM-as-a-Judge” prompts, each returning a rating of 0, 2, or 4. These ratings are then converted to a [0,1] scale and averaged. Higher scores indicate that the model’s answer closely matches the reference.

0 → Response is inaccurate or off-topic,
2 → Response partially aligns,
4 → Response exactly aligns.

Response Groundedness evaluates whether a response is supported by the retrieved contexts. That is, whether each claim can be found (fully or partially) in the provided data. This works similarly to Answer Accuracy, using two distinct “LLM-as-a-Judge” prompts with ratings of 0, 1, or 2, which are then normalised to a [0,1] scale.

0 → Not grounded at all,
1 → Partially grounded,
2 → Fully grounded.

Trajectory Evaluation tracks the intermediate steps and tool calls executed by the LLM, helping to monitor the reasoning process. A judge LLM evaluates the trajectory produced by the workflow, considering the tools used during execution. It returns a floating-point score between 0 and 1, where 1 represents a perfect trajectory.

Let’s run evaluations to see how it works in practice.

nat eval --config_file src/happiness_v3/configs/config.yml

As a result of running the evaluations, we get several files in the output directory we specified earlier. One of the most useful ones is workflow_output.json. This file contains execution results for each sample in our evaluation set, including the original question, the answer generated by the LLM, the expected answer, and a detailed breakdown of all intermediate steps. This file can help you trace how the system worked in each case.

Here’s a shortened example for the first sample.

{
  "id": 1,
  "question": "In what country was the happiness score highest in 2021?",
  "answer": "Finland",
  "generated_answer": "Finland had the highest happiness score in 2021 with a score of 7.821.",
  "intermediate_steps": [...],
  "expected_intermediate_steps": []
}

For Answer Accuracy and Response Groundedness, we achieved the highest possible scores (1.0 out of 1.0 on average), which is always nice to see. Here is the returned file.

{
  "average_score": 1.0,
  "eval_output_items": [
    {
      "id": 1,
      "score": 1.0,
      "reasoning": {
        "user_input": "In what country was the happiness score highest in 2021?",
        "reference": "Finland",
        "response": "Finland had the highest happiness score in 2021 with a score of 7.821.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 2,
      "score": 1.0,
      "reasoning": {
        "user_input": "What contributed most to the happiness score in 2024?",
        "reference": "Social Support",
        "response": "In 2024, **Social Support** contributed most to the happiness score, with an average impact of 1.333 points. This was followed very closely by **GDP per capita** with an impact of 1.319 points. These two factors were significantly more influential than other contributors like life expectancy (0.551), freedom (0.749), generosity (0.113), and perceptions of corruption (0.147). This suggests that having strong social connections and relationships, along with economic prosperity, were the primary drivers of happiness across countries in 2024.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 3,
      "score": 1.0,
      "reasoning": {
        "user_input": "How UK's rank changed from 2019 to 2024?",
        "reference": "The UK's rank dropped from 13th in 2019 to 23rd in 2024.",
        "response": "The UK's rank in the World Happiness Report changed from 13th place in 2019 to 23rd place in 2024, representing a decline of 10 positions. The happiness score also decreased from 7.1645 in 2019 to 6.7280 in 2024.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 4,
      "score": 1.0,
      "reasoning": {
        "user_input": "Are people in France happier than in Germany based on the latest report?",
        "reference": "No, Germany is at 22nd place in 2024 while France is at 33rd place.",
        "response": "Based on the latest 2024 World Happiness Report, people in Germany are happier than people in France. Germany has a happiness score of 6.753 (ranked 22nd globally), while France has a happiness score of 6.593 (ranked 33rd globally). The difference is 0.16 points in Germany's favor.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 5,
      "score": 1.0,
      "reasoning": {
        "user_input": "How much in percents are people in Poland happier in 2024 compared to 2019?",
        "reference": "Happiness in Poland increased by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024.",
        "response": "People in Poland are approximately 7.87% happier in 2024 compared to 2019. The happiness score increased from 6.1863 in 2019 to 6.6730 in 2024, representing an increase of 0.4867 points or about 7.87%.",
        "retrieved_contexts": [...]
      }
    }
  ]
}

For trajectory evaluation, we achieved an average score of 0.95. To understand where the model fell short, let’s look at one non-ideal example. For the fifth question, the judge correctly identified that the agent followed a suboptimal path: it took 8 steps to reach the final answer, even though the same result could have been achieved in 4–5 steps. As a result, this trajectory received a score of 0.75 out of 1.0.

Let me evaluate this AI language model's performance step by step:

## Evaluation Criteria:
**i. Is the final answer helpful?**
Yes, the final answer is clear, accurate, and directly addresses the question. 
It provides both the percentage increase (7.87%) and explains the underlying 
data (happiness scores from 6.1863 to 6.6730). The answer is well-formatted 
and easy to understand.

**ii. Does the AI language use a logical sequence of tools to answer the question?**
Yes, the sequence is logical:
1. Query country statistics for Poland
2. Retrieve the data showing happiness scores for multiple years including 
2019 and 2024
3. Use a calculator to compute the percentage increase
4. Formulate the final answer
This is a sensible approach to the problem.

**iii. Does the AI language model use the tools in a helpful way?**
Yes, the tools are used appropriately:
- The `country_stats` tool successfully retrieved the relevant happiness data
- The `calculator_agent` correctly computed the percentage increase using 
the proper formula
- The Python evaluation tool performed the actual calculation accurately

**iv. Does the AI language model use too many steps to answer the question?**
This is where there's some inefficiency. The model uses 8 steps total, which 
includes some redundancy:
- Steps 4-7 appear to involve multiple calls to calculate the same percentage 
(the calculator_agent is invoked, which then calls Claude Opus, which calls 
evaluate_python, and returns through the chain)
- Step 7 seems to repeat what was already done in steps 4-6
While the answer is correct, there's unnecessary duplication. The calculation 
could have been done more efficiently in 4-5 steps instead of 8.

**v. Are the appropriate tools used to answer the question?**
Yes, the tools chosen are appropriate:
- `country_stats` was the right tool to get happiness data for Poland
- `calculator_agent` was appropriate for computing the percentage change
- The underlying `evaluate_python` tool correctly performed the mathematical 
calculation

## Summary:
The model successfully answered the question with accurate data and correct 
calculations. The logical flow was sound, and appropriate tools were selected. 
However, there was some inefficiency in the execution with redundant steps 
in the calculation phase.

Looking at the reasoning, this turns out to be a surprisingly comprehensive evaluation of the entire LLM workflow. What’s especially valuable is that it works out of the box and doesn’t require any ground-truth data. I would definitely advise using this evaluation for your applications.

Comparing different versions

Evaluations become especially powerful when you need to compare different versions of your application. Imagine a team focused on cost optimisation and considering a switch from the more expensive sonnet model to haiku. With NAT, changing the model takes less than a minute, but doing so without validating quality would be risky. This is exactly where evaluations shine.

For this comparison, we’ll also introduce another observability tool: W&B Weave. It provides particularly handy visualisations and side-by-side comparisons across different versions of your workflow.

To get started, you’ll need to sign up on the W&B website and obtain an API key. W&B is free to use for personal projects.

export WANDB_API_KEY=

Next, install the required packages and plugins.

uv pip install wandb weave
uv pip install "nvidia-nat[weave]"

We also need to update our YAML config. This includes adding Weave to the telemetry section and introducing a workflow alias so we can clearly distinguish between different versions of the application.

general:                                             
  telemetry:                                          
    tracing:                                          
      phoenix:                                        
        _type: phoenix                               
        endpoint: http://localhost:6006/v1/traces 
        project: happiness_report
      weave: # specified Weave
        _type: weave
        project: "nat-simple"

eval:
  general:
    workflow_alias: "nat-simple-sonnet-4-5" # added alias
    output:
      dir: ./.tmp/nat/happiness_v3/eval/evals/
      cleanup: false  
    dataset:
      _type: json
      file_path: src/happiness_v3/data/evals.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: chat_llm
    groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: chat_llm
    trajectory_accuracy:
      _type: trajectory
      llm_name: chat_llm

For the haiku version, I created a separate config where both chat_llm and calculator_llm use haiku instead of sonnet.

Now we can run evaluations for both versions.

nat eval --config_file src/happiness_v3/configs/config.yml
nat eval --config_file src/happiness_v3/configs/config_simple.yml

Once the evaluations are complete, we can head over to the W&B interface and explore a comprehensive comparison report. I really like the radar chart visualisation, since it makes trade-offs immediately obvious.

With sonnet, we observe higher token usage (and higher cost per token) as well as slower response times (24.8 seconds compared to 16.9 seconds for haiku). However, despite the clear gains in speed and cost, I wouldn’t recommend switching models. The drop in quality is too large: trajectory accuracy falls from 0.85 to 0.55, and answer accuracy drops from 0.95 to 0.45. In this case, evaluations helped us avoid breaking the user experience in the pursuit of cost optimisation.

You can find the full implementation on GitHub.

Summary

In this article, we explored the NeMo Agent Toolkit’s observability and evaluation capabilities.

We worked with two observability tools (Phoenix and W&B Weave), both of which integrate seamlessly with NAT and allow us to log what’s happening inside our system in production, as well as capture evaluation results.
We also walked through how to configure evaluations in NAT and used W&B Weave to compare the performance of two different versions of the same application. This made it easy to reason about trade-offs between cost, latency, and answer quality.

The NeMo Agent Toolkit delivers solid, production-ready solutions for observability and evaluations — foundational pieces of any serious LLM application. However, the standout for me was W&B Weave, whose evaluation visualisations make comparing models and trade-offs remarkably straightforward.

Thank you for reading. I hope this article was insightful. Remember Einstein’s advice: “The important thing is not to stop questioning. Curiosity has its own reason for existing.” May your curiosity lead you to your next great insight.