HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows

a modern vector database—Neo4j, Milvus, Weaviate, Qdrant, Pinecone—there is a very high chance that Hierarchical Navigable Small World (HNSW) is already powering your retrieval layer. It is quite likely you did not choose it while building the database, nor did you tune it or even know it is there. And yet, HNSW is quietly deciding what your LLM sees as truth. It determines which document chunks are fed into your RAG pipeline, which memories your agent recalls, and ultimately, whether the model answers correctly—or hallucinates confidently.

As your vector database grows, retrieval quality degrades gradually:

No exceptions are raised
No errors are logged
Latency often looks perfectly fine

But the context quality deteriorates, and your RAG system becomes less reliable over time—even though the embedding model and distance metric remain unchanged.

In this article, I demonstrate—using controlled experiments and real data—how HNSW impacts retrieval quality as database size grows, why this degradation is worse than flat search, and what you can realistically do about it in production RAG systems.

Specifically, I will:

Build a practical, reproducible use case to measure the effect of HNSW on RAG retrieval quality using Recall@k.
Show that, for fixed HNSW settings, recall degrades faster than flat search as the corpus grows.
Discuss practical tuning strategies for balancing recall and latency beyond simply increasing ef_search of HNSW.

What is HNSW?

HNSW is a graph-based algorithm for Approximate Nearest Neighbor (ANN) search. It organizes data into multiple layers of connected neighbors and uses this graph structure to speed up search.

Each vector is connected to a limited number of neighbors in each layer. During a search, it performs a greedy search through these layers, and the number of neighbors checked at each layer is constant (controlled by M and ef_search), which makes the search process logarithmic with respect to the number of vectors. Compared to flat search, where time complexity is O(N), HNSW search has a time complexity of O(log N), which means the time required for a search grows very slowly (logarithmically) as compared to linear search. We will see this in the result of our use case.

Parameters of HNSW index

1. Build time parameters: M and ef_construction. Can be set before building the database only.

M defines the maximum number of connections (neighbors) that each vector (node) can have in each layer of the graph. A higher M means more connections, making the graph denser and potentially increasing recall but at the cost of more memory and slower indexing.

Ef_construction controls the size of the candidate set used during the construction of the graph. Essentially, it governs how thoroughly the graph is constructed during indexing. A higher value for ef_construction means the graph is built more thoroughly, with more candidates being considered before making each connection, which leads to a higher quality graph and better recall at the cost of increased memory and slower indexing.

For a general purpose RAG application, typical values of M are within a range of 12 and 48 and ef_construction between 64 and 200.

2. Query time parameter: ef_search

This defines the number of candidate nodes (or vectors) to explore during the query process (i.e., during the search for nearest neighbors). It controls how thorough the search process is by determining how many candidates are evaluated before the search result is returned. A higher value for ef_search means the search will explore more candidates, leading to better recall but potentially slower queries.

What is Recall@k?

Recall@k is a key metric for measuring the accuracy of vector search and RAG systems. It measures the ability of the retriever to find the relevant chunks for a user query within the top k results. It is significant because If the retriever misses the chunks containing the information required to answer the question (low recall), the LLM cannot possibly generate an accurate answer in the response synthesis step, regardless of how powerful it is.

\[ \text{Recall}@k = \frac{\text{relevant items retrieved in top } k}{\text{total number of relevant items in the corpus}} \]

In practice, this is a difficult metric to measure because the denominator (ground truth documents) is not easily known for a real-life production system. What we will do instead, is design a use case where the ground truth (eg; vector index) is unique and known, and Recall@k will measure the average number of times it is retrieved in top-k results, over a large number of sample queries.

For instance, Recall@5 will measure the average number of times the ground truth index appeared in top-5 retrievals over 500 queries.

For a RAG, the acceptable range of Recall@5 is 70-90% and Recall@10 is 80-95%, and we will see that our use case adheres to these ranges for the Flat index.

Use Case

To test HNSW, we need a vector database with sufficiently large number of vectors (> 100,000). There does not seem to be such a large public dataset available consisting of document chunks and associated query(ies) for which the particular chunk would be considered as ground truth. And even if it were there, natural language can be ambiguous, so it is difficult to confidently say which all chunks in the corpus could be considered as relevant for a query (the denominator in Recall@k formula). Developing such a curated dataset would require finding a large number of documents, chunking and embedding them, then developing queries for the chunks. That would be a resource intensive process.

Instead, lets re-imagine our RAG problem as “given a short caption (query), we would like to retrieve the most relevant images from the dataset”.

For this approach, I utilized the publicly available LAION-Aesthetics dataset. To access, you will need to be logged in to Hugging Face, and agree to the terms mentioned. Details about the dataset is available at the LAOIN site here. It contains a huge number of rows containing URLs to images along with a text caption. They look like the following:

I downloaded a subset of rows and generated 200,000 CLIP embeddings of the images to build the vector database. The text captions of the images can be conveniently used as queries for RAG. And each caption has only one image vector as the ground truth so the denominator of Recall@k is exactly known for all queries. Also, the CLIP embeddings of the image and its caption are never an exact match, so there is enough “fuzziness” in retrievals similar to a purely document RAG, where a text query is used to retrieve relevant document chunks using a distance metric. This will be evident when we see the chart of Recall@k in the next sections.

Measuring Recall@k for Flat vs HNSW

We adopt the following approach:

Embeddings of 200k images are stored as .npy file.
From the laion dataset, 500 captions(queries) are randomly selected and embedded using CLIP. The selected query indices also form the ground truth as they correspond to the unique image for the query.
The database is built in increments of 50,000 vectors, so 4 iterations of size 50k, 100k, 150k and 200k vectors. Both flat and HNSW indexes are built. HNSW is built using M=16 and ef_construction=100.
Recall@k is calculated for k = 1, 5, 10, 15 and 20 based upon if the ground truth indices are included in top-k results.
First, the Recall@k values are calculated for each of the query vectors and averaged over the number of samples (500).
Then, average Recall@k values are calculated for HNSW ef_search values of 10, 20, 40, 80 and 160.
Finally, 5 charts are drawn, one for each of the Recall@k values. Each chart depicts the evolution of Recall@k as database size grows for Flat index and different ef_search values of HNSW.

The code can be viewed here

import pandas as pd
import numpy as np
import faiss
import torch
import open_clip
import os
import random
import matplotlib.pyplot as plt

def evaluate_subset(size, embeddings_all, df_all, query_vectors_all, eval_indices_all, ef_search_values):
    # Subset embeddings
    embeddings = embeddings_all[:size]
    dimension = embeddings.shape[1]
    
    # Build Indices in-memory for this subset size
    index_flat = faiss.IndexFlatL2(dimension)
    index_flat.add(embeddings)
    
    index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
    index_hnsw.hnsw.efConstruction = 100
    index_hnsw.add(embeddings)

    num_samples = len(eval_indices_all)
    results = []

    ks = [1, 5, 10, 15, 20]

    # Evaluate Flat
    flat_recalls = {k: 0 for k in ks}
    for i, qv in enumerate(query_vectors_all):
        _, I = index_flat.search(qv, max(ks))
        target = eval_indices_all[i]
        for k in ks:
            if target in I[0][:k]:
                flat_recalls[k] += 1
    
    flat_res = {"Setting": "Flat"}
    for k in ks:
        flat_res[f"R@{k}"] = flat_recalls[k]/num_samples
    results.append(flat_res)

    # Evaluate HNSW with different efSearch
    for ef in ef_search_values:
        index_hnsw.hnsw.efSearch = ef
        hnsw_recalls = {k: 0 for k in ks}
        for i, qv in enumerate(query_vectors_all):
            _, I = index_hnsw.search(qv, max(ks))
            target = eval_indices_all[i]
            for k in ks:
                if target in I[0][:k]:
                    hnsw_recalls[k] += 1
        
        hnsw_res = {"Setting": f"HNSW (ef={ef})", "ef": ef}
        for k in ks:
            hnsw_res[f"R@{k}"] = hnsw_recalls[k]/num_samples
        results.append(hnsw_res)
    
    return results

def format_table(size, results):
    ks = [1, 5, 10, 15, 20]
    lines = []
    lines.append(f"\nDatabase Size: {size}")
    lines.append("="*80)
    header = f"{'Index/efSearch':

And the results are the following:

**Recall vs Database size for k = 1, 10, 15, 20**

Observations

For the Flat index (dotted line), Recall@5 and Recall@10 are in the range of 0.70 – 0.85, as can be expected of real life RAG applications.
Flat index provides the best Recall@k across all database sizes and forms a benchmark upper bound for HNSW.
At any given database size, Recall@k increases for a higher k. So for database size of 100k vectors, Recall@20 > Recall@15 > Recall@10 > Recall@5 > Recall@1. This is understandable as with a higher k, there is more probability that the ground truth index is present in the retrieved set.
Both Flat and HNSW deteriorate consistently as the database size grows. This is because high-dimensional vector spaces become increasingly crowded as the number of vectors grows.
Performance improves for HNSW for higher ef_search values.
As the database size approaches 200k, HNSW appears to degrade faster than Flat search.

Does HNSW degrade faster than Flat Search?

To view the relative performance of Flat vs HNSW indexes as database size grows, a slightly different approach is adopted:

The database indexes construction and query selection process remains same as before.
Instead of considering the ground truth, we calculate the overlap between the Flat index and each of the HNSW ef_search results for a given retrieval count(k).
Five charts are drawn for each of the k values, denoting the evolution of overlap as database size grows. For a perfect match with Flat index, the HNSW line will show a score of 1. More importantly, if the degradation of HNSW results is more than Flat index, the line will have a negative slope, else will have a horizontal or positive slope.

The code can be viewed here

import pandas as pd
import numpy as np
import faiss
import torch
import open_clip
import os
import random
import matplotlib.pyplot as plt
import time

def evaluate_subset_compare(size, embeddings_all, df_all, query_vectors_all, ef_search_values):
    # Subset embeddings
    embeddings = embeddings_all[:size]
    dimension = embeddings.shape[1]
    
    # Build Indices in-memory for this subset size
    index_flat = faiss.IndexFlatL2(dimension)
    index_flat.add(embeddings)
    
    index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
    index_hnsw.hnsw.efConstruction = 100
    index_hnsw.add(embeddings)

    num_samples = len(query_vectors_all)
    results = []

    ks = [1, 5, 10, 15, 20]
    max_k = max(ks)

    # 1. Evaluate Flat once for this subset
    flat_times = []
    flat_results_all = []
    for qv in query_vectors_all:
        start_t = time.perf_counter()
        _, I_flat_all = index_flat.search(qv, max_k)
        flat_times.append(time.perf_counter() - start_t)
        flat_results_all.append(I_flat_all[0])
    
    avg_flat_time_ms = (sum(flat_times) / num_samples) * 1000

    # 2. Evaluate HNSW relative to Flat
    for ef in ef_search_values:
        index_hnsw.hnsw.efSearch = ef
        
        hnsw_times = []
        # Track intersection counts for each k
        overlap_counts = {k: 0 for k in ks}
        for i, qv in enumerate(query_vectors_all):
            # HNSW top-max_k
            start_t = time.perf_counter()
            _, I_hnsw_all = index_hnsw.search(qv, max_k)
            hnsw_times.append(time.perf_counter() - start_t)
            
            # Flat result was already pre-calculated
            I_flat_all = flat_results_all[i]
            
            for k in ks:
                set_flat = set(I_flat_all[:k])
                set_hnsw = set(I_hnsw_all[0][:k])
                intersection = set_flat.intersection(set_hnsw)
                overlap_counts[k] += len(intersection) / k
        
        avg_hnsw_time_ms = (sum(hnsw_times) / num_samples) * 1000
        
        hnsw_res = {
            "Setting": f"HNSW (ef={ef})", 
            "ef": ef,
            "FlatTime_ms": avg_flat_time_ms,
            "HNSWTime_ms": avg_hnsw_time_ms
        }
        for k in ks:
            # Average over all queries
            hnsw_res[f"R@{k}"] = overlap_counts[k] / num_samples
        results.append(hnsw_res)
    
    return results

def format_all_tables(db_sizes, ef_search_values, all_results):
    ks = [1, 5, 10, 15, 20]
    lines = []
    
    # 1. Create one table for each Recall@k
    for k in ks:
        k_label = f"R@{k}"
        lines.append(f"\nTable: {k_label} (HNSW Overlap with Flat)")
        lines.append("=" * (20 + 12 * len(db_sizes)))
        
        # Header
        header = f"{'ef_search':

And the results are the following:

**Flat vs HNSW Index Overlap for k = 5**

**Flat vs HNSW Index Overlap for k = 1, 10, 15, 20**

Observations

In all cases, the lines have a negative slope, indicating that HNSW degrades faster than the Flat index as database grows.
Higher ef_search values degrade slower than lower values, which fall quite sharply.
Higher ef_search values have significant overlap (>90%) with the benchmark flat search as compared to the lower values.

Recall-latency trade-off

We know that HNSW is faster than Flat search. To see it in action, I have also measured the average latency in the code of the previous section. Here are the average search times (in ms):

Database size	50,000	100,000	150,000	200,000
Flat Index	5.1440	9.3850	14.8843	18.4100
HNSW (ef=10 )	0.0851	0.0742	0.0763	0.0768
HNSW (ef=20 )	0.1159	0.0876	0.0959	0.0983
HNSW (ef=40 )	0.1585	0.1366	0.1415	0.1493
HNSW (ef=80 )	0.2508	0.2262	0.2398	0.2417
HNSW (ef=160 )	0.4613	0.3992	0.4140	0.4064

Observations

HNSW is orders of magnitude faster than flat search, which is the primary reason for it to be the search algorithm of choice for almost all vector databases.
Time taken by Flat search increases almost linearly with database size (O(N) complexity)
For a given ef_search value (a row), HNSW time is nearly constant. At this scale (200k vectors), HNSW latency remains nearly constant.
As ef_search increases in a column, the HNSW time increases very significantly. For instance, time taken for ef=160 is 3X that of ef=40

Tuning the RAG pipeline

The above analysis shows that while HNSW is definitely the option to adopt in a production scenario for latency reasons, there is a need to periodically tune the ef_search to maintain the latency-recall balance as the database grows. Some best practices that should be adopted are as follows:

Given the difficulty of measuring Recall@k in a production database, keep a test case repository of ground truth document chunks and queries, which can be run at regular intervals to check retrieval quality. We could start with the most frequent queries asked by the user, and chunks that are needed for a good recall.
Another indirect way to ascertain recall quality would be to use a powerful LLM to judge the quality of the retrieved context. Instead of asking “Did we get the very best documents for the user query?”, which is difficult to say precisely for a large database, we can ask a slightly weaker question “Does the retrieved context actually contain the answer to the user’s question?” and let the judge LLM respond to that.
Collect user feedback in production. User rating of a response along with any manual correction can be used as a trigger for performance tuning.
While tuning ef_search, start with a conservatively high value, measure Recall@k, then reduce until latency is acceptable.
Measure Recall at the top_k that the RAG uses, usually between 3 and 10. Consider relaxing top_k to 15 or 20 and let the LLM decide which chunks in the given context to use for the response during synthesis step. Assuming the context does not become too large to fit in the LLM’s context window, such an approach would enable a high recall while having a moderate ef_search value, thereby keeping latency low.

Hybrid RAG pipeline

HNSW tuning using ef_search cannot fix the issue of falling recall with increasing database size beyond a point. That is because vector search even using a flat index, becomes noisy when too many vectors are packed close together in the N dimensional space (N being the number of dimensions output by the embedding model). As the charts in the above section show, recall drops by 10%+ as database grows from 50k to 200k. The reliable way to maintain recall is to use metadata filtering (eg; using a knowledge graph), to identify potential document ids and run retrieval only for those. I discuss this in detail in my article GraphRAG in Practice: How to Build Cost-Efficient, High-Recall Retrieval Systems

Key Takeaways

HNSW is the default retrieval algorithm in most vector databases, but it is rarely tuned or monitored in production RAG systems.
Retrieval quality degrades silently as the vector database grows, even when latency remains stable.
For the same corpus size, Flat search consistently achieves higher Recall@k than HNSW, serving as a useful upper bound for evaluation.
HNSW recall degrades faster than Flat search for fixed ef_search values as database size increases.
Increasing ef_search improves recall, but latency grows rapidly, creating a sharp recall–latency trade-off.
Simply tuning HNSW parameters is insufficient at scale—vector search itself becomes noisy in dense embedding spaces.
Hybrid RAG pipelines using metadata filters (SQL, graphs, inverted indexes) are the most reliable way to maintain recall at scale.

Conclusion

HNSW has earned its place as the backbone of modern vector databases—not because it is perfectly accurate, but because it is fast enough to make large-scale semantic search practical.

However, in RAG systems, speed without recall is a false optimization.

This article shows that as vector databases grow, retrieval quality deteriorates quietly—especially under approximate search—while latency metrics remain deceptively stable. The result is a system that appears healthy from an infrastructure perspective, but gradually feeds weaker context to the LLM, increasing hallucinations and reducing answer quality.

The solution is not to abandon HNSW, nor to arbitrarily increase ef_search.

Instead, production-grade RAG systems must:

Measure retrieval quality explicitly and regularly.
Treat Flat search as a recall baseline.
Continuously rebalance recall and latency.
And ultimately, move toward hybrid retrieval architectures that narrow the search space before vector similarity is applied.

If your RAG system’s answers are getting worse as your data grows, the problem may not be your LLM, your prompts, or your embeddings—but the retrieval algorithm you never realized you were relying on.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

_{Images used in this article are synthetically generated. LAOIN-Aesthetics dataset used under CC-BY 4.0 license. Figures and code created by me}

Source link

#HNSW #Scale #RAG #System #Worse #Vector #Database #Grows