...

How to Do Evals on a Bloated RAG Pipeline


to Building an Overengineered Retrieval System. That one was about building the entire system. This one is about doing the evals for it.

In the previous article, I went through different parts of a RAG pipeline: chunking the data properly, query optimization, retrieval (semantic, BM25, or hybrid search), re-ranking, expanding chunks to neighbors, building the context, and then generation with an LLM.

One of the questions I got was: does expanding chunks to neighbors actually improve answers, or does it just add noise and make it harder for the model to stay grounded?

So that’s what we’ll test here. We’ll run some basic evaluations and look at metrics like faithfulness, answer relevancy, context relevance, and hallucination rate, and compare results across different models and datasets. 

I’ve collected most of the results here and here already, but we’ll go through them too.

As a note, I’m planning to compare this kind of “advanced” pipeline to a more naive baseline later. But this article is mainly about evaluating the pipeline as it is.


I always go through some intro sections before I dig in, but if you’re new-new, I’d first read up on how to build a basic RAG system, how embeddings work, and an actual intro to evals/metrics. Then you can also read how to build the over-engineered pipeline I introduced above, or at least skim it. 

If none of this is new, then skip to the results part.

Why we perform evals

Evals are about making sure to pressure-test the system on a bigger (more targeted) corpus than your favorite 10 questions, and making sure that whatever changes you push don’t change the quality of the system.

Changes in data, prompts, or models can very much affect performance without you seeing it.

You may also need to show your team the general performance of the system you’ve built before being allowed to test it on real users.

But before you do this, you need to decide what to test.

What does a successful system look like to you? If you care about multi-hop, you need questions that actually require multi-hop. If you care about Q&A and proper citations, you test for that. Otherwise, you end up evaluating the wrong thing.

This is a bit like doing investigative work: you test something, you try to understand the results, and then you build better tests.

To do this well, you should try to build a golden set (often from user logs) to test with.

This isn’t always possible, so in situations like this we build synthetic datasets. This may not be the best way to do it, as it will clearly be biased and won’t reflect what your users will actually ask. Nevertheless, you may need to start somewhere.

For this article, I’ve created three different datasets so we can discuss it: one created from the ingested corpus, one that creates messy user questions from the corpus, and one with random questions on RAG that haven’t been generated from the corpus at all.

You’ll be able to see how these datasets give us different results on the metrics, but that they all mean different things.

What to think about

I’m not going to go through everything there is to think about here, because doing evals well is pretty difficult (although also fun if you like statistics and data).

But there are a few things you need to keep in mind: LLM judges are biased, cherry-picking questions is a problem, gold answers are best if you have them, and using a larger dataset with tags helps you break down where and how the system is failing.

If you’ve read the eval metrics article, you’ve already seen the idea of LLM-as-a-judge. It can be useful, but it’s not inherently reliable because it has baked-in preferences and blind spots.

There are things that will make you go mad, like a judge punishing an answer that’s based on the corpus but not explicitly stated in the retrieved chunks (summaries / small inferences), or judging the same answer differently depending on how the question is phrased.

You’ll realize this later when you dig into the questions that are failing to understand why.

Another thing to keep in mind is to make sure not to “cherry-pick” questions, even if you feel the urge to.

You obviously have to start somewhere, but the goal is to get close to what your users are actually asking, find the issues, and to update the dataset continuously based on what the system seems to fail in. It’s easy to get nice numbers if you mostly test “easy” questions, but then the eval becomes less useful.

The best thing is to have not just real user questions but also gold answers.

So even if you can “bypass” having references by using an LLM judge, having the correct answers for those questions is best. That’s when you can use the LLM to judge whether the output matches the gold answer, instead of asking it to judge the answer on its own.

Sample size matters too. Too small and it may not be reliable. Too big and it’s easy to miss smaller problems.

If you have enough data, you can tag questions into topics, different wordings (pessimistic / typical phrasing), and different types (short / long / messy) so you can see what breaks where.

I’ve heard recommendations that start with something like 200–1,000 real queries with gold answers if you want this to be a real evaluation setup.

Since this entire exercise is hypothetical, and the system has ingested documents to demo the idea of expanding to neighbors, the evals will have datasets that have been synthetically generated, and thus less reliable, but there is still learnings we can get from it.

Deciding on metrics & datasets

This section is about two things: which metrics I’m using to evaluate the pipeline, and how I’m using them across datasets to see if neighbor expansion seems to help.

First, if you haven’t read about evals for LLM systems at all, go read this article. It gives you a taxonomy of the different metrics out there (RAG included).

Since I’m lazy for this, I needed reference-free metrics, but this will also limit us to what we can actually test. We can have the judge look at the context, the question, and the generated answer.

A few metrics that can help here are faithfulness (is the answer grounded in the provided context), answer relevancy (does it actually answer the question), context relevancy (how much of the context is just noise), and hallucination (how many claims are actually backed up by the provided context).

Since we want to figure out if seed expansion is useful, and without building two different pipelines, we can do one simple comparison: ask the judge to look at the seed chunks vs. the final expanded context and score how much of the answer comes from each for the faithfulness metric.

If grounding improves when the judge sees the expanded context, that’s at least evidence that the model is using the expanded chunks and it’s not just noise. We would need more testing, though, to say for sure which is the winner.

Finally, the datasets matter as much as the metrics.

If you’ve read the first article, you know that all the docs that have been ingested are scientific articles that mention RAG. So all the questions that we create here need to be about RAG.

I have generated three different datasets with a different RAG flavor.

The first is based on the ingested corpus, going through each scientific article and writing two questions each that it can answer.

The second is doing the same but providing messy questions like, “how does k2 btw rag improve answer fetching compared to naive rag, like what’s the similarity scores in terms of q3?”

This messy user questions dataset could be good to test the query optimizer if you read the first article (but I don’t have those results for you here). Here it will tell us if stating things differently would skew the results.

The third dataset is based on 66 random RAG questions found online. This means that these questions may not have answers in the corpus (the ingested RAG articles are just from September to October, so we don’t know exactly what they contain).

So the first two will evaluate how well the pipeline behaves, whether it can answer questions on the documents it has, and the third one tells us what it is missing and how it behaves on questions that it might not be able to answer.

Though this is a bit simplified, as the first questions may be structured on sections and the random ones may be better answered by seed chunks.

Running the evals

To run the evals, you first need to run the pipeline on every question, for every model, and store the results.

If you don’t store everything you need, you can’t debug later. You want to be able to go from a low score back to the exact answer, the exact retrieved context, and the exact model settings.

I also wanted to compare models, because people assume “bigger model = better answers,” and that’s not always true, especially for easier tasks. So I’m running the same pipeline across GPT-5-mini, GPT-5.1, and GPT-5.2, for several datasets.

Once that’s done, I build the eval layer on top of those stored outputs.

I used RAGAS for the standard metrics and DeepEval for the custom ones. You can obviously build it manually, but it’s much easier this way. I love how seamless DeepEval is, though it’s harder to debug if you find issues with the judge later. 

A few specifics: the pipeline runs with no context cap, the judge model is gpt-4o-mini, and we use n=3 for RAGAS and n=1 for the custom judges.

Since neighbor expansion is the whole point of this pipeline, remember we also run this check: for faithfulness, we score grounding against the seed chunks and against the full expanded context, to see if there’s a difference.

Eval results of datasets & models

Let’s run the evals for the different datasets, metrics, and models to see how the pipeline is doing and how we can interpret the results. Remember you can find the full results here and here (especially if you dislike my childish sketches).

We can start with the results from the dataset generated by the corpus.

Remember that the real tabe you’ll find here

The table above shows the first RAGAS metrics. Faithfulness (does it stay grounded in the context provided) and answer relevancy (does it answer the question) are very high.

This is to be expected, as we’re basically giving it questions that it should be able to answer with the documents. If these showed low numbers, there would be something severely off in the pipeline.

It also gives us back seed faithfulness, where the judge is estimating how grounded the answer is to the seed chunks. This one is overall a lot lower than the full context faithfulness, 12–18 points across the different models.

In fewer words: we can say that the LLM is using some of the full context, not just the seed chunks, when generating its answer.

What we can’t judge though is if the seed-only answer would have been just as good. This will require us to run two pipelines and compare the same metrics and datasets for each.

Now let’s look at these next metrics (for the same dataset).

Remember that the real tabe you’ll find here

I would have estimated that context relevance would decrease here, as it’s looking at the full context that pulls in up to 10 different chunk neighbors for the section.

A reason for this may be that the questions generated are based on sections, which means that added context helps to answer them.

Structure citations (i.e. does it cite its claims correctly) looks alright, but hallucination is high, which is good (1 means no made-up claims in the answer).

Now you’ll see that the different models provide very little difference in terms of performance.

Yes, this is quite an easy Q&A task. But it does demonstrate that the additional size of the model may not be needed for everything, and the added context expansion may be able to act as a buffer for the smaller models.

Now let’s look at the results if we change the dataset to those messy user questions instead.

Remember that the real tabe you’ll find here

We see a few drops in points, but they still stay high, though without isolating the outliers here we can’t say why. But faithfulness looks lower when only judging with the seed chunks for the messy user questions, which is interesting.

Let’s now turn to the third dataset, which will be able to tell us a lot more.

Remember that the real tabe you’ll find here

We see all around worse numbers which is of course expected, the corpus that has been ingested probably can’t answer all of these questions so well. This helps us point to where we have missing information.

Faithfulness stays high though still for the full context runs. Here the difference from the seed-only runs are a lot higher, which means the added expansion is being used more in the answer.

Something that was strange here was how GPT-5.2 consistently did worse for answer relevance across two different runs. This can be a metric thing, or it can be a model thing where it answers more cautiously than before, thus getting a lower score.

This also tells you why it’s so important to test these new models on your own pipelines before adding them in.

Let’s continue with the other metrics for the random dataset.

Remember that the real tabe you’ll find here

Context relevance is very low, so the judge thinks there is a lot of noise in the context that does not directly answer the question for this dataset.

The context relevance metric rewards retrieving a high proportion of directly relevant chunks, but seed+expand intentionally adds neighbor chunks (definitions, adjacent sections, captions) that should improve grounding. Unfortunately, there may be some noise that comes with that.

Both structure citations and hallucination score worse here, probably because it’s harder to stay grounded if you’re not provided any relevant information to use.

Now this was a first build for this pipeline. Based on these results, you can improve it until you drive those numbers up. Maybe change the metrics to be more custom to your pipeline.

Both datasets show different things: the corpus-based one shows the pipeline works, the random one shows there is missing information and that there is a lot more noise to navigate around.

What to examine

So what do you do now that you have all these scores? You can look at regressions, worst offenders, and do some random sampling.

Like I said before, this whole exercise is a bit like investigative work, which you’ll love if you love working with data. If you do have enough data and you’ve tagged it, then by all means break it down and look at the individual numbers and see what you find. But there are three checks you should always do.

First is regressions: check the stuff that used to pass and now fails following changes. 

Second is checking the worst offenders (bottom ~5–10 data points) to see what is happening. It could be the metric, the judge, it could be the retrieval, and it could be the model. 

Below is one of the worst ones for full faithfulness and answer relevancy.

"question": "how often does entity linking mess up acronyms and lookalikes, and how do i test that",
"answer": "The documents do not report a direct number for "how often" entity linking fails specifically on acronyms or look‑alike names.\n\nWhat we do see is that:\n- Even when structure is correct, LLMs often link to the wrong property or entity (semantic inaccuracies) in SPARQL generation over a scholarly KG [2]. This is the same failure mode you'd see when an acronym (e.g., NMF, ACE) is mapped to the wrong node.\n- In medical KGs, entity linking relies on exact keywords and then semantic similarity of embeddings when no exact match exists [3]. This last step explicitly allows linking to a merely "semantically related" concept rather than the exact one, which is where lookalikes and acronym collisions can occur [3].\n- In conversational entity‑centric systems, the largest single error category is coreference/entity confusion (27% of failures) [4], showing that confusing similar mentions is a major practical problem even outside strict EL.\n\nHow to test this in your setting, using patterns from the papers:\n1) Build a targeted test set\n   - Create a small benchmark that stresses acronyms and visually/lexically similar entities, analogous to how MATE is used to stress cross‑modal entity linking [1].\n   - Include positive pairs (correct acronym→entity) and hard negatives (different entities sharing acronym or near‑identical names).\n\n2) Run your full EL pipeline\n   - Use the same identifier / keyword / embedding‑based linking strategy described in MedKGent [3].\n\n3) Manually and/or LLM‑assisted evaluation\n   - Have humans or an LLM score whether each link is correct vs. incorrect, similar to the triple‑validity rubric used in MedKGent [3] and the SPARQL error audit in FIRESPARQL [2].\n\n4) Analyze error types\n   - Break down errors like the semantic vs. structural analysis in [2] and the error table in [4], but with categories "acronym collision," "lookalike surface form," etc.\n\nThis will give you an empirical error rate for your own stack, focused specifically on acronyms and lookalikes.",
"full_faithfulness": 0.24,
"seed_faithfulness": 0.8260869565000001,
"answer_relevancy": 0.0,
"context_relevance": 0.208549739206933,
"context_relevance_reason": "The context provided does not directly address the user's question about how often entity linking messes up acronyms and lookalikes, nor does it offer methods for testing that. While it discusses entity linking and its evolution, it lacks specific information on the issues related to acronyms and lookalikes, which is the core of the user's inquiry.",
"hallucination_score": 0.6572611409640697,
"hallucination_reason": "The response accurately identifies that the documents do not provide a specific frequency for how often entity linking fails with acronyms or lookalikes, which aligns with the input query. It also discusses relevant issues such as semantic inaccuracies and coreference confusion, which are pertinent to the topic. However, it lacks direct references to specific claims made in the context, such as the limitations of traditional EL methods or the role of exact keywords in medical KGs, which could have strengthened the response further.",
"full_contexts": ["Entity Linking\n\nEntity Linking (EL) has evolved from text-only methods to Multimodal Entity Linking (MEL), and more recently to Cross-Modal Entity Linking (CMEL), which supports crossmodal reasoning. Traditional EL methods associate textual entities with their corresponding entries in a knowledge base, but overlook non-textual information (Shen, Wang, and Han 2015; Shen et al. 2023). MEL extends EL by incorporating visual information as auxiliary attributes to enhance alignment between entities and knowledge base entries (Gan et al. 2021; Liu et al. 2024b; Song et al. 2024).", "However, MEL does not establish cross-modal relations beyond these auxiliary associations, thereby limiting genuine cross-modal interaction.", "CMEL goes further by treating visual content as entities-aligning visual entities with their textual counterparts-to construct MMKGs and facilitate explicit crossmodal inference (Yao et al. 2023). Research on CMEL remains in its early stages, lacking a unified theoretical framework and robust evaluation protocols. The MATE benchmark is introduced to assess CMEL performance, but its synthetic 3D scenes fall short in capturing the complexity and diversity of real-world images (Alonso et al. 2025). To bridge this gap, we construct a CMEL dataset featuring greater real-world complexity and propose a spectral clustering-based method for candidate entity generation to drive further advances in CMEL research.", "3 Error type analysis on generated SPARQL queries\n\nDespite the improvements of LLMs on QA over SKGs, LLMs face limitations when handling KG-specific parsing. The experimental results conducted by Sören Auer et al.[2] showed that only 63 out of 100 handcrafted questions could be answered by ChatGPT, of which only 14 answers were correct. To better understand why LLMs fail to generate the correct SPARQL query to a NLQ, we conduct a pilot experiment on using ChatGPT(GPT-4) with a random one-shot example to generate SPARQL queries for 30 handcrafted questions in the SciQA benchmark datasets.", "Insights from this pilot experiment revealed two major categories of errors LLMs tend to make in this task: semantic inaccuracies and structural inconsistencies. Semantic inaccuracies occur when LLMs fail to link the correct properties and entities in ORKG, despite generating SPARQL queries with correct structure. Our observations reveal that LLMs tend to rely on the example provided in the one-shot learning process to generate the correct structure for a certain type", "of questions, but often struggle with linking the correct properties and entities because LLMs do not learn the content of the underlying KG. Structural inconsistencies arise due to LLMs' lack of ontological schema of the underlying KG, leading to errors in query structure, such as missing or abundant links (triples), despite correctly linking to the mentioned entities or properties.", "Figure 1 shows the example of semantic inaccuracies and structural inconsistencies problem with the generated SPARQL queries in our pilot study. In the example of the semantic inaccuracies problem, ChatGPT failed to link the correct property orkgp:P15687; instead, it linked to a wrong property orkgp:P7101. In the example of the structural inconsistencies problem, the SPARQL query generated by ChatGPT directly links Contribution to Metrics, fails to detect the correct schema of the ORKG where Contribution and Metric are connected via Evaluation.", "Fig. 1: Examples of semantic inaccuracies and structural inconsistencies problem with the generted SPARQL queries\n\nSemantic inaccuracies Problem\n\nFail to link the correct properties and entities in ORKG\n\nWhat is the maximum sample size?\n\nContribution Evaluation Metric P34 P2006 P7046\n\nStructural inconsistencies Problem\n\nMake errors in query structure, such as missing or abundant links (triples)\n\nWhat are the metrics used by paper \"Using NMF-based text summarization\n\nto improve supervised and unsupervised classification?\n\norkgp:P15687 rdfs:label Sample size (n)\n\norkgp:P7101 rdfs:label has elements", "2 Results\n2.1 Method overview\n\nas its confidence score. For instance, if the triple (NPPA, Negative Correlate, Water) appears in 90% of the outputs, its confidence score is 0.9. Low-confidence triples (score  75% recall at 10-session distance. Error bars denote standard deviation across three runs.\n\nFactual Recall vs. Session Depth (MultiWOZ-Long)\n\nStateless LLM Vector RAG Entity-RAG Semantic Anchoring Session Depth\n\n|---:|:-------------------|---------:|---------:|-----------:|\n| 0 | Stateless LLM | 49.8 | 44.1 | 2 |\n| 1 | Vector RAG | 68.7 | 62.5 | 3.2 |\n| 2 | Entity-RAG | 72.1 | 68.3 | 3.6 |\n| 3 | Semantic Anchoring | 81.4 | 77.9 | 4.2 |\n\nTable 2: Performance on DialogRE-L. Semantic Anchoring achieves consistent gains across metrics, suggesting effectiveness in relation extraction tasks that require long-range entity tracking.", "5.3 Ablation Studies\n\nTable 3 examines the role of linguistic components. Removing discourse tagging reduces FR by 4.7 points, while excluding coreference resolution reduces DC by 6.2 points. Eliminating all symbolic features collapses performance to Vector RAG levels. These results align with observed error patterns (§5.6), underscoring the value of symbolic features.", "5.4 Qualitative Examples\n\nIn MultiWOZ-Long, when the user later asks 'Did he confirm the time for the taxi?' , Semantic Anchoring retrieves:\n\n[Entity: John Smith][CorefID: E17] confirmed the taxi is booked for 9 AM.", "By contrast, Vector RAG surfaces unrelated mentions of 'taxi.' Additional examples, including cases where Semantic Anchoring fails, are shown in Appendix C.", "| | Variant | FR (%) | DC (%) | UCS (/5) |\n|---:|:-------------------------|---------:|---------:|-----------:|\n| 0 | Full Model | 83.5 | 80.8 | 4.3 |\n| 1 | - Discourse Tagging | 78.8 | 75.6 | 4 |\n| 2 | - Coreference Resolution | 80.1 | 74.6 | 4.1 |\n| 3 | - Dependency Parsing | 81.2 | 78.5 | 4.1 |\n| 4 | Dense-only (Vector RAG) | 71.6 | 66.4 | 3.4 |\n\nTable 3: Ablation results on MultiWOZ-Long. Removing discourse or coreference modules significantly reduces FR and DC, respectively. Without all symbolic features, performance falls to the dense-only baseline.", "5.5 Human Evaluation\n\nFive trained annotators rated 50 randomly sampled conversations for User Continuity Satisfaction (UCS). Agreement was high ( α = 0 . 81 ). As Table 1 shows, Semantic Anchoring achieves the highest UCS (4.3), with annotators noting better consistency in entity references. Full protocol details are in Appendix B.", "5.6 Error Analysis\n\nTable 4 categorizes common failures. Coreference mistakes (27%) and parsing errors (19%) are the most frequent, consistent with ablation findings. Discourse mislabeling (15%) often arises in sarcasm or overlapping speech. While overall error frequency is lower than dense retrieval, these remain open challenges.", "| | Error Type | Proportion of Failures |\n|---:|:----------------------|:-------------------------|\n| 0 | Parsing errors | 19% |\n| 1 | Coreference mistakes | 27% |\n| 2 | Discourse mislabeling | 15% |\n| 3 | Other / miscellaneous | 39% |\n\nTable 4: Error analysis on MultiWOZ-Long. Coreference mistakes are the most frequent error type, followed by parsing and discourse issues. These patterns align with ablation results."],
"seed_texts": ["Entity Linking\n\nEntity Linking (EL) has evolved from text-only methods to Multimodal Entity Linking (MEL), and more recently to Cross-Modal Entity Linking (CMEL), which supports crossmodal reasoning. Traditional EL methods associate textual entities with their corresponding entries in a knowledge base, but overlook non-textual information (Shen, Wang, and Han 2015; Shen et al. 2023). MEL extends EL by incorporating visual information as auxiliary attributes to enhance alignment between entities and knowledge base entries (Gan et al. 2021; Liu et al. 2024b; Song et al. 2024).", "Insights from this pilot experiment revealed two major categories of errors LLMs tend to make in this task: semantic inaccuracies and structural inconsistencies. Semantic inaccuracies occur when LLMs fail to link the correct properties and entities in ORKG, despite generating SPARQL queries with correct structure. Our observations reveal that LLMs tend to rely on the example provided in the one-shot learning process to generate the correct structure for a certain type", "We visualized a local subgraph of the constructed KG with COVID-19 as the central node, highlighting five surrounding relationship triples, as shown in Figure 2 d . Each node is characterized by six key attributes: the Identifier, which uniquely references the node and normalizes multiple synonymous mentions to a standardized terminology entry; the Entity Type, which classifies the entity; the Terminology, which maps the entity type to its corresponding standard terminology; the Page Link, providing a reference to the entity in the Terminology; the Exact Keywords, which lists common names and aliases of the entity in lowercase; and the Semantic Embedding, a vector representation of the entity. In practice, these attributes facilitate entity linking within a query by matching entities to their corresponding nodes in the KG. When the Identifier of an entity in the query is available, entity linking can be efficiently performed using this unique reference. In the absence of an Identifier, precise matching", "Given a user query q , we first extract the set of medical entities present in the question, denoted as E q = { e q 1 , e q 2 , · · · } . When using PubTator3 [80]-the same entity recognition tool employed during KG constructioneach extracted entity is assigned a unique identifier. This allows for efficient entity linking by matching these identifiers to the corresponding nodes N q = { n q 1 , n q 2 , · · · } within the graph. Alternatively, if medical entities are extracted using other methods-such as prompting a LLM-they may lack standardized identifiers. In such cases, the extracted entity mentions are first converted to lowercase and matched against the Exact Keywords attribute of each node in the KG. A successful match enables linkage of the entity to the corresponding graph node. In both approaches, if an entity cannot be linked via its identifier or if its surface form does not appear in any node's Exact Keywords list, we apply a semantic similarity strategy to complete the entity linking process. Specifically, the embedding of the query entity is computed using the same model employed for generating node-level semantic representations ( i.e. , BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext [81]) and is compared against the Semantic Embedding of all nodes in the KG. The entity is then linked to the node with the highest semantic similarity score, which may correspond to either the exact concept or a semantically related medical entity. This entity linking framework-combining identifier-based matching, lexical normalization, and semantic embedding-ensures robust and flexible integration of KG-derived knowledge into downstream QA tasks.", "5.6 Error Analysis\n\nTable 4 categorizes common failures. Coreference mistakes (27%) and parsing errors (19%) are the most frequent, consistent with ablation findings. Discourse mislabeling (15%) often arises in sarcasm or overlapping speech. While overall error frequency is lower than dense retrieval, these remain open challenges."]

Here we can see the corpus didn’t actually contain direct evidence for what was asked but the seed chunks were more contained so faithfulness got a higher score when only providing the seed chunks. 

Answer relevancy was punished unusually low for not being able to answer the direct question. 

This one was an outlier though and in general full faithfulness is higher than seed faithfulness. 

You can also check a good one to understand how it estimates the ones that passes.

"question": "What retrieval settings help reduce needle-in-a-haystack issues for medical QA?",
"answer": "To reduce "needle‑in‑a‑haystack" problems in medical QA, the retrieval setup should focus on bringing back a small, very focused set of highly relevant items instead of large, noisy context.\n\n1. Control context size and granularity\n- Use abstracts or snippets rather than full articles; long, unfocused context diluted model performance in biomedical QA, so reranking down to a smaller, high‑quality subset was necessary for good answers [1].\n- Limit the number of documents passed to the model (e.g., rerank from thousands down to a few hundred or less) to avoid disorienting the model with too much text [1].\n\n2. Use multi‑stage retrieval\n- Start with high‑recall lexical retrieval (e.g., BM25 over PubMed) to get a broad set of candidates (up to thousands) [1].\n- Apply a strong semantic reranker to pick only the top slice for RAG (e.g., top 300) [1].\n- If evidence is missing, fall back to broader web‑or multi‑source search over trusted medical sites to fill gaps, rather than over‑stuffing the first pass with everything [1].\n\n3. Improve query quality\n- Use an LLM to rewrite or expand queries (synonyms, related terms, less rare jargon) when initial recall is poor, boosting the chance that the key paper is retrieved without over‑expanding every query by default [1].\n\n4. Use structured, corpus‑aware retrieval\n- Retrieve from structured medical knowledge sources such as knowledge graphs and prepend only the most relevant subgraph, which consistently improved accuracy across several medical QA benchmarks [3].\n- In multimodal/heterogeneous setups, use corpus‑specific retrieval for different knowledge sources so each corpus is searched with methods tuned to its content, instead of one generic retriever over everything [2].",
"full_faithfulness": 1.0,
"seed_faithfulness": 0.8636363636000001,
"answer_relevancy": 0.9135841092,
"context_relevance": 0.8976322813389481,
"context_relevance_reason": "The context passages provide comprehensive insights into retrieval settings that can mitigate needle-in-a-haystack issues in medical QA. Specifically, the discussion on the integration of LLMs for information retrieval, the use of semantic reranking, and the multi-stage retrieval approach directly addresses the user's question. The emphasis on maintaining relevance while expanding query coverage and the mention of ensemble methods highlight effective strategies for improving retrieval accuracy in complex biomedical queries. However, while the information is highly relevant, a more explicit connection to specific 'needle-in-a-haystack' challenges could enhance clarity.",
"hallucination_score": 0.8893376167284271,
"full_contexts": ["Abstract\n\nBiomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.", "3. Our methodology\n\n3.1. Information Retrieval Pipeline\n\nTo support high-quality RAG for Phase A+, we developed an IR pipeline that integrates traditional lexical search with LLM-based query generation and semantic reranking (Fig. 1).", "If the initial query returns fewer than five documents, we invoke Gemini 2.5 Pro Preview (05-06) to automatically revise the query. The model is prompted to enhance retrieval recall by enabling approximate matching and omitting overly rare or domain-specific terms. This refinement step is done to improve the query coverage while maintaining relevance. Our experiments have shown that this process is required in less than 5% of the queries in the BioASQ 13 test set.", "We index all PubMed article titles and abstracts in an Elasticsearch instance, using BM25 retrieval as the ranking function. For each input question, we use Gemini 2.0 Flash to generate a structured Elasticsearch query that captures the semantic intent of the question using synonyms, related terms, and full boolean query string syntax rules supported by Elasticsearch. This query is validated using regular expressions and then is used to retrieve up to 10,000 documents.", "Following document retrieval, we apply a semantic reranking model (Google semantic-ranker-default004) to reduce the number of candidate documents [11]. This model re-scores the initially retrieved documents based on semantic similarity to the original question, allowing us to select the top 300 most relevant documents. This reranked subset is used for downstream RAG-based QA, since despite really long context supported by modern Transformer architectures [12, 13], we could not get adequate QA results on full article abstracts without this step.", "Finally, we have added additional IR searches to handle the cases where a QA step does not return a response based on the evidence retrieved from Elasticsearch. We have observed that Elasticsearch context might not provide sufficient evidence for QA in 3-7% of test cases for Phase A+, depending on the batch. An automated process is used to expand IR sources to address these cases. First, we are using a Google search restricted to PubMed sources to attempt to find new matches. If that fails, we extend our sources to include Home of the Office of Health Promotion and Disease Prevention, WebMD,\n\nThis multi-stage retrieval approach, combining LLM-generated queries, a traditional BM25 search, and semantic reranking, enables flexible, high-recall, and high-precision document selection tailored to complex biomedical queries.", "Figure 1: IR process\n\nPubMed corpus in Elasticsearch Query Generation (Gemini 2.0 Flash) Query Valida- tion and IR (BM25, ≤ 10,000 docs) Results 

Here, the extra (full) context actually contains the missing evidence which helped and therefore got a higher score than the seeds. 

But frankly though I find the answer is a bit lackluster still, there is nothing all that interesting in the answer. But this points to issues on the retrieval side though.

Lastly, check a few random ones and see how it is evaluating these.

"question": "how do i measure if i\u2019m getting diverse sources instead of 10 chunks from the same pdf", 
"answer": "You can treat \u201cdiverse sources\u201d as something you measure directly, not just eyeball.\n\n1. Simple source-diversity score (like in the paper)\nIn Document 1, they define a retrieval diversity metric as the fraction of unique review IDs among the top\u201110 chunks: Diversity = (number of distinct IDs) / 10 [1].\nYou can copy this idea for documents:\n- Give every PDF (or document) a unique ID.\n- When you retrieve K chunks (say K=10), look at which document each chunk came from.\n- Compute: Diversity = (number of distinct document IDs) / K.\n  \u2022 If all 10 chunks come from the same PDF, Diversity = 1 / 10 = 0.1 (very low).\n  \u2022 If all 10 chunks come from different PDFs, Diversity = 10 / 10 = 1.0 (perfect diversity).\n\nYou can log this number for each query to monitor whether you\u2019re repeatedly pulling from the same file or actually spreading across your corpus.\n\n2. Combine with relevance\nIn Document 1, they also track how close the chunks are to the query using an average cosine similarity between the query embedding and each chunk [1]. In practice you want both:\n- High similarity (on\u2011topic).\n- High diversity (many different source IDs).\n\n3. Optional: richer diversity later\nDocument 2 shows that diversity can also be measured at different levels (lexical, semantic, ingredient) when you have multiple outputs for one input [2]. By analogy, once you\u2019re happy with source diversity, you could later add:\n- Semantic diversity between chunks (are they saying different things?).\nBut the easiest starting point is the per-query \u2018distinct document IDs / K\u2019 metric.",
"full_faithfulness": 0.8823529412000001, 
"seed_faithfulness": 0.5294117647000001,
"answer_relevancy": 0.8588673985, 
"context_relevance": 0.5303662744007874,
"context_relevance_reason": "The context passages provide relevant information about measuring diversity in retrieval systems, particularly in the context of recipe adaptation and LLMs. However, while some passages discuss diversity metrics and retrieval methods, they do not directly address the user's specific question about measuring diverse sources versus multiple chunks from the same PDF. The relevance of the context is somewhat indirect, leading to a moderate score.",
"hallucination_score": 0.7209711030557213,
"hallucination_reason": "The response effectively outlines a method for measuring source diversity by introducing a simple source-diversity score and providing a clear formula. It aligns well with the context, which discusses retrieval diversity metrics. However, while it mentions combining relevance with diversity, it does not explicitly connect this to the context's focus on average cosine similarity, which could enhance the completeness of the answer. Overall, the claims are mostly supported, with minor gaps in direct references to the context."
"full_context": ["D. Question and Answering (QA)\n\nFor retrieval of reviews, we sampled five Spotify-centric queries and retrieved the top K = 10 review chunks for each. We measured two unsupervised metrics:\n\nAverage Cosine Similarity : the mean cosine similarity between each query embedding and its top-10 chunk embeddings.", "Retrieval Diversity : the fraction of unique review IDs among all retrieved chunks (distinct IDs / 10).\n\nOur retriever achieved perfect diversity and cosine scores from 0.618 to 0.754, demonstrating reliable, on-topic retrieval. Table IX summarizes these proxy metrics.", "For generation of answers, we randomly sampled 20 generated answers (each paired with its cited snippets) and annotated them ourselves, confirming that each answer (1) reflected the cited excerpts, (2) covered the main points of those excerpts, and (3) was written in clear, reader-friendly prose. We found the responses to be accurate and comprehensive.", "| | Query | Avg. Cosine Sim. | Diversity |\n|---:|:-------------------------------------------------------------------------------|-------------------:|------------:|\n| 0 | What complaints do users have about | 0.713 | 1 |\n| 1 | What do listeners say about Spotify crashing or freezing on startup? | 0.754 | 1 |\n| 2 | How do listeners describe the app's offline playback experience? | 0.696 | 1 |\n| 3 | How do users report errors or failures when downloading songs for offline use? | 0.618 | 1 |\n| 4 | What do users say about Spotify's crossfade and track-transition experience? | 0.65 | 1 |\n\nTABLE IX RETRIEVAL PROXY METRICS (K=10) FOR SELECTED SPOTIFY QUERIES (HIGHER DIVERSITY IS BETTER)", "2 Related Work\n\nRecipe Cross-Cultural Adaptation Recipe cross-cultural adaptation (Cao et al., 2024) involves modifying recipes to suit the dietary preferences and writing styles of the target culture. This includes not just translation, but also adjusting formats, ingredients, and cooking methods to align with cultural norms. Previous studies (Cao et al., 2024; Pandey et al., 2025; Zhang et al., 2024) often treat recipe adaptation as a cross-cultural translation task, exploring how prompt-based LLMs can be used for Chinese-English recipe adaptation.", "However, LLM-based recipe adaptation still faces challenges. Magomere et al.'s (2024) show that such methods can be misleading and may reinforce regional stereotypes. Hu et al.'s (2024) further identify two main challenges: First, LLMs lack culinary cultural knowledge, leading to insufficient cultural appropriateness. Second, the adapted recipes have quality issues, such as changing ingredients without adjusting the cooking steps accordingly. They propose another way to address these issues, namely through cross-cultural recipe retrieval, which sources recipes from real cooking practices within the target culture, generally offering better quality and cultural alignment. However, compared to directly using LLMs, the retrieved recipes often have low similarity to the original.", "All the above-mentioned studies primarily focus on the quality of generated results, including cultural appropriateness and their preservation of the original . However, they overlook the diversity of the results and do not explore the use of RAG for cross-cultural recipe adaptation. Our study emphasizes the trade-off between diversity and quality, with a particular focus on RAG-based approaches.", "Diversity in text generation, IR, and RAG Previous studies (Lanchantin et al., 2025) have shown that post-training LLMs tend to sharpen their output probability distribution, leading to reduced response diversity. This has raised a common concern about the insufficient diversity of LLMs, particularly in creative tasks. Several stochastic sampling-based decoding methods are widely used to control the level of diversity, most notably by adjusting hyperparameters such as temperature (Shi et al., 2024). However, these methods often still fall short in achieving sufficient diversity and may lead to a rapid decline in output quality, which is another important factor to consider when measuring diversity (Lanchantin et al., 2025).", "Figure 2: Overview of CARRIAGE . Diversity components are highlighted. We first enhance the diversity of retrieved results, then we enable more diverse use of contextual information via dynamic context selection, and inject contrastive context to prevent the LLM from generating outputs similar to previously generated recipes.\n\nMulti-Query Retrieval Source Culture Recipe Target Culture Diversity-aware Reranking Query Rewriting Dynamic Context Organization Pool of Previously Generated Recipes LLM Generation Contrastive Context Injection Previously : Diversity component Reference Recipes Selection Relevance Diversity\n\nMay generate multiple times\n\nIn IR, retrieving text with high diversity can cover a wider range of subtopics, thereby accommodating the potentially diverse preferences of different users. Methods such as diverse query rewriting (Mohankumar et al., 2021) and diversity-aware re-ranking (Carbonell and Goldstein, 1998; Krestel and Fankhauser, 2012) can effectively enhance the diversity of retrieval results. Some recent works (Carraro and Bridge, 2024) have explored using LLMs to enhance diversity in re-ranking.", "In RAG, prior works have mainly focused on retrieving diverse results to obtain more comprehensive information, such as mitigating context window limitations (Wang et al., 2025) and addressing multi-hop question answering tasks (Rezaei and Dieng, 2025). These works are primarily framed as question answering, aiming to acquire comprehensive knowledge to produce a single correct answer. Consequently, the evaluation metrics emphasize answer accuracy rather than diversity. In contrast, our task naturally permits multiple valid answers. Therefore, we adopt different strategies to encourage answer diversity and use metrics that explicitly evaluate the diversity of final outputs. While prior works have largely focused on retrieving diverse contexts, our approach goes a step further by investigating how to utilize such diverse contexts to produce diverse outputs.", "5 Metrics\n\nOur evaluation metrics focus on two key aspects: diversity and quality . To assess diversity, we consider factors such as lexical , semantic , and ingredient diversity from a per-input perspective. As a trade-off, we evaluate quality from two dimensions: the preservation of the source recipe, and cultural appropriateness for users in the target culture.", "5.1 Diversity\n\nKirk et al.'s (2023) have proposed two paradigms for measuring diversity: across-input (over pairs of one input and one output) and per-input diversity (one input, several outputs). Per-input diversity helps us investigate whether a single recipe can be adapted into multiple variants to meet different dietary preferences, while across-input diversity assesses whether the generated recipes collectively exhibit a diverse range of linguistic patterns. Because our investigation primarily focuses on whether a single recipe can be adapted into diverse variations to meet a broader range of needs, we adopt the per-input diversity setting as our main experimental focus. The across-input diversity setting is discussed further in Section 7.", "For a diversity metric D , under model configuration c , A denotes a set of adapted recipes,", "containing N source recipes, we define A i c = { a i c, 1 , a i c, 2 , . . . , a i c,K } as the set of K adaptations for the i -th source recipe under configuration c . The per-input diversity is defined as follows:\n\nLexical Diversity Lexical diversity is a measure of the variety of vocabulary used within a set of text. High lexical diversity indicates using a broad range of unique words, which may correspond to a wider variety of ingredients, cooking methods, and flavors. We employ Unique-n (Johnson, 1944) to evaluate lexical diversity, calculated as the ratio of unique n -grams to the total number of n -grams, reflecting the proportion of distinct n -grams and indicates vocabulary richness. Following prior work (Guo et al., 2024), we report the average Unique-n across unigrams, bigrams, and trigrams.", "Semantic Diversity Semantic diversity refers to the variety of meanings within a set of texts. High semantic diversity suggests a wide range of culinary ideas. We measure per-input semantic diversity using the average pairwise cosine distance between Sentence-BERT embeddings because embedding-based semantic diversity enables a more fine-grained evaluation of variation beyond surface-level vocabulary (Stasaski and Hearst, 2023). Specifically, for a set of K adapted recipes, we define the sum of their average semantic similarity and semantic diversity to be 1. In this formulation, higher semantic similarity implies lower semantic diversity. We define semantic diversity, scaled to the range [0 , 1] , as follows:\n\nwhere e represents embeddings of the recipe.", "Ingredient Diversity Ingredient diversity measures the variation in sets of ingredients across different recipes. Ingredient choice plays a crucial role in recipe diversity (Borghini, 2015). Compared to general lexical variation, ingredient changes offer a more precise signal for capturing the key factors driving diversity in recipes.", "Recipes often describe the same ingredient in varying ways, such as differences in quantity or units of measurement. To mitigate this, we introduce Standard Ingredients , which retain only the ingredient name by stripping away non-essential details. Since ingredient descriptions typically follow the format   , we extract only the  to compute ingredient diversity. The detailed procedure is provided in Appendix B.", "To avoid the influence of differing ingredient counts across recipes, we define ingredient diversity as the ratio of unique standardized ingredients to the total number of ingredients. For a set of K adapted recipes, let the set of standardized ingredients for each recipe be I 1 , I 2 , . . . , I K . We define ingredient diversity as follows:", "5.2 Quality\n\nWe define automatic quality metrics to serve as a trade-off when evaluating recipe diversity. Further details on the training and evaluation of the CultureScore model are provided in Appendix B.", "Source Recipe Preservation Following prior work (Cao et al., 2024; Hu et al., 2024), we employ BERTScore (Zhang* et al., 2020), a common cosine embedding-based method for measuring the similarity between source and output recipes. Previous studies have shown that BERTScore aligns well with human evaluations in terms of source recipe preservation (Hu et al., 2024).", "Cultural Appropriateness We propose a novel metric, the Recipe Cultural Appropriateness Score (CultureScore), to assess how well the output recipes align with the target culture. Specifically, we employ a BERT-based classifier (Devlin et al., 2019; Ca\u00f1ete et al., 2020) to predict the country of origin of a recipe using its title and list of ingredients as input. The CultureScore is defined as the average predicted probability assigned by the model to the target culture across all adapted recipes, with higher scores indicating better cultural alignment. Since Latin American and Spanish recipes share the same language, the model cannot rely on linguistic cues; instead, it must learn to distinguish them based on culturally relevant features such as ingredients, flavors, and writing styles. Given that the classification model achieves an F1-score of over 90% in distinguishing between Latin American and Spanish recipes, we consider CultureScore a reliable proxy for assessing cultural appropriateness.", "| | | Method. | Diversity ( \u2191 ).Lexical | Diversity ( \u2191 ).Ingredient | Diversity ( \u2191 ).Semantic | Quality ( \u2191 ).CultureScore | Quality ( \u2191 ).BERTScore |\n|---:|:------------------|:----------------------------------------------------------------------------|:--------------------------|:-----------------------------|:---------------------------|:-----------------------------|:--------------------------|\n| 0 | Closed- Book LLMs | Llama3.1-8B Qwen2.5-7B Gemma2-9B | 0.557 0.551 0.538 | 0.667 0.531 0.639 | 0.232 0.247 0.196 | 0.451 0.404 0.468 | 0.404 0.439 0.370 |\n| 1 | IR | JINA-ES CARROT CARROT-MMR | 0.742 0.735 0.741 | 0.937 0.925 0.941 | 0.459 0.462 0.527 | 0.511 0.512 0.503 | 0.295 0.301 0.298 |\n| 2 | RAG | Vanilla-LLaMA RAG CARROT-LLaMA RAG CARROT-MMR-LLaMA RAG CARROT-MMR-Qwen RAG | 0.518 0.525 0.520 0.532 | 0.748 0.765 0.748 0.536 | 0.155 0.152 0.164 0.212 | 0.383 0.385 0.393 0.402 | 0.551 0.545 0.545 0.448 |\n| 3 | Ours | CARRIAGE -LLaMA CARRIAGE -Qwen | 0.577 0.628 | 0.739 0.676 | 0.269 0.303 | 0.463 0.590 | 0.442 0.342 |", "Table 1: Evaluation of diversity and quality on the RecetasDeLaAbuel@ dataset shows that our proposed CARRIAGE -LLaMA outperforms all closed-book LLMs in terms of Pareto efficiency across both diversity and quality metrics. In contrast, IR-based methods struggle with preserving the source recipe, while other RAG-based approaches tend to underperform in terms of diversity and cultural appropriateness."

This above is interesting as you see that the evaluator is taking a reasonable generalization and treats it as “kinda supported” or “meh.”

Evaluating this item above with another LLM, it said that it thought the context relevance comment was a bit whiny.

But as you see, low scores don’t have to mean that the system is bad. You have to examine why they are low and also why they are high to understand how the judge works or why the pipeline is failing. 

A good example is context relevance here. Context relevance is measuring how much of the retrieved context was useful. If you’re doing neighbor expansion, you will almost always pull in some irrelevant text, so context precision will look worse, especially if the corpus can’t answer the question in the first place. 

The question is whether the extra context actually helps grounding (faithfulness / hallucination rate) enough to be worth the noise.

Some cautious notes

Okay, some notes before I round this off. 

Testing seeds here is clearly biased, and it doesn’t tell us whether they were actually useful on their own. We’d have to build two different pipelines and compare them side by side to say that properly.

I’ll try to do this in the future, with this exact uses case.

I should also note that the system has very few docs in the pipeline: only about 150 PDF files along with some Excel files, which is a few thousand pages. But I have to demo this in public, and this was the only way.

Remember we used only metrics on the generation side here, looking at the context that was retrieved. If the context retrieved is lying or has conflicting information, these metrics may not show it, you have to measure that before.

Furthermore many teams also build their own custom metrics, that’s unique to their pipeline and to what they want to test, and even if you start like this, with general ones, you can spot what you need along the line to build better targeted ones. 

The last thing to note is LLM judge bias. I’m using OpenAI models both for the RAG pipeline and for the evaluator. This is generally not recommended, but as long as the models are different from the generator and judge it is generally accepted.


Hopefully it was a fun read (if you’re a dork about data like me).

Stay tuned for the last article where I try to test a more naive pipeline against this one (hopefully I have time to finish it).

If you want to stay updated or just connect you’ll find me at LinkedIn, my website, or Medium (and here too).

Source link

#Evals #Bloated #RAGPipeline