...

When (Not) to Use Vector DB


. They solve a real problem, and in many cases, they are the right choice for RAG systems. But here’s the thing: just because you’re using embeddings doesn’t mean you need a vector database.

We’ve seen a growing trend where every RAG implementation starts by plugging in a vector DB. That might make sense for large-scale, persistent knowledge bases, but it’s not always the most efficient path, especially when your use case is more dynamic or time-sensitive.

At Planck, we utilize embeddings to enhance LLM-based systems. However, in one of our real-world applications, we opted to avoid a vector database and instead used a simple key-value store, which turned out to be a much better fit.

Before I dive into that, let’s explore a simple, generalized version of our scenario to explain why.

Foo Example

Let’s imagine a simple RAG-style system. A user uploads a few text files, maybe some reports or meeting notes. We split those files into chunks, generate embeddings for each chunk, and use those embeddings to answer questions. The user asks a handful of questions over the next few minutes, then leaves. At that point, both the files and their embeddings are useless and can be safely discarded.

In other words, the data is ephemeral, the user will ask only a few questions, and we want to answer them as fast as possible.

Now pause for a second and ask yourself:

Where should I store these embeddings?


Most people’s instinct is: “I have embeddings, so I need a vector database”, but pause for a second and think about what’s actually happening behind that abstraction. When you send embeddings to a vector DB, it doesn’t just “store” them. It builds an index that speeds up similarity searches. That indexing work is where a lot of the magic comes from, and also where a lot of the cost lives.

In a long-lived, large-scale knowledge base, this trade-off makes perfect sense: you pay an indexing cost once (or incrementally as data changes), and then spread that cost over millions of queries. In our Foo example, that’s not what’s happening. We are doing the opposite: constantly adding small, one-off batches of embeddings, answering a tiny number of queries per batch, and then throwing everything away.

So the real question is not “should I use a vector database?” but “is the indexing work worth it?” To answer that, we can look at a simple benchmark.

Benchmarking: No-Index Retrieval vs. Indexed Retrieval

Photo by Julia Fiander on Unsplash

This section is more technical. We’ll look at Python code and explain the underlying algorithms. If the exact implementation details aren’t relevant to you, feel free to skip ahead to the Results section.

We want to compare two systems:

  1. No indexing at all, just keeps embeddings in memory and scans them directly.
  2. A vector database, where we pay an indexing cost upfront to make each query faster.

First, consider the “no vector DB” approach. When a query comes in, we compute similarities between the query embedding and all stored embeddings, then select the top-k. That’s just K-Nearest Neighbors without any index.

import numpy as np

def run_knn(embeddings: np.ndarray, query_embedding: np.ndarray, top_k: int) -> np.ndarray:
    sims = embeddings @ query_embedding
    return sims.argsort()[-top_k:][::-1]

The code uses the dot product as a proxy for cosine similarity (assuming normalized vectors) and sorts the scores to find the best matches. It literally just scans all vectors and picks the nearest ones.

Now, let’s look at what a vector DB typically does. Under the hood, most vector databases rely on an approximate nearest neighbor (ANN) index. ANN methods trade a bit of accuracy for a large boost in search speed, and one of the most widely used algorithms for this is HNSW. We’ll use the hnswlib library to simulate the index behavior.

import numpy as np
import hnswlib

def create_hnsw_index(embeddings: np.ndarray, num_dims: int) -> hnswlib.Index:
    index = hnswlib.Index(space='cosine', dim=num_dims)
    index.init_index(max_elements=embeddings.shape[0])
    index.add_items(embeddings)
    return index

def query_hnsw(index: hnswlib.Index, query_embedding: np.ndarray, top_k: int) -> np.ndarray:
    labels, distances = index.knn_query(query_embedding, k=top_k)
    return labels[0]

To see where the trade-off lands, we can generate some random embeddings, normalize them, and measure how long each step takes:

import time
import numpy as np
import hnswlib
from tqdm import tqdm

def run_benchmark(num_embeddings: int, num_dims: int, top_k: int, num_iterations: int) -> None:
    print(f"Benchmarking with {num_embeddings} embeddings of dimension {num_dims}, retrieving top-{top_k} nearest neighbors.")

    knn_times: list[float] = []
    index_times: list[float] = []
    hnsw_query_times: list[float] = []

    for _ in tqdm(range(num_iterations), desc="Running benchmark"):
        embeddings = np.random.rand(num_embeddings, num_dims).astype('float32')
        embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        query_embedding = np.random.rand(num_dims).astype('float32')
        query_embedding = query_embedding / np.linalg.norm(query_embedding)

        start_time = time.time()
        run_knn(embeddings, query_embedding, top_k)
        knn_times.append((time.time() - start_time) * 1e3)

        start_time = time.time()
        vector_db_index = create_hnsw_index(embeddings, num_dims)
        index_times.append((time.time() - start_time) * 1e3)

        start_time = time.time()
        query_hnsw(vector_db_index, query_embedding, top_k)
        hnsw_query_times.append((time.time() - start_time) * 1e3)

    print(f"BENCHMARK RESULTS (averaged over {num_iterations} iterations)")
    print(f"[Naive KNN] Average search time without indexing: {np.mean(knn_times):.2f} ms")
    print(f"[HNSW Index] Average index construction time: {np.mean(index_times):.2f} ms")
    print(f"[HNSW Index] Average query time with indexing: {np.mean(hnsw_query_times):.2f} ms")

run_benchmark(num_embeddings=50000, num_dims=1536, top_k=5, num_iterations=20)

Results

In this example, we use 50,000 embeddings with 1,536 dimensions (matching OpenAI’s text-embedding-3-small) and retrieve the top-5 neighbors. The exact results will vary with different configs, but the pattern we care about is the same.

I encourage you to run the benchmark with your own numbers, it’s the best way to see how the trade-offs play out in your specific use case.

On average, the naive KNN search takes 24.54 milliseconds per query. Building the HNSW index for the same embeddings takes around 277 seconds. Once the index is built, each query takes about 0.47 milliseconds.

From this, we can estimate the break-even point. The difference between naive KNN and indexed queries is 24.07 ms per query. That implies you need 11,510 queries before the time saved on each query compensates for the time spent building the index.

Generated using the benchmark code: A graph comparing naive KNN and indexed search efficiency

Furthermore, even with different values for the number of embeddings and top-k, the break-even point remains in the thousands of queries and stays within a fairly narrow range. You don’t get a scenario where indexing starts to pay off after just a few dozen queries.

Generated using the benchmark code: A graph showing break-even points for various embedding counts and top-k settings (image by author)

Now compare that to the Foo example. A user uploads a small set of files and asks a few questions, not thousands. The system never reaches the point where the index pays off. Instead, the indexing step simply delays the moment when the system can answer the first question and adds operational complexity.

For this sort of short-lived, per-user context, the simple in-memory KNN approach is not only easier to implement and operate, but it is also faster end-to-end.

If in-memory storage is not an option, either because the system is distributed or because we need to preserve the user’s state for a few minutes, we can use a key-value store like Redis. We can store a unique identifier for the user’s request as the key and store all the embeddings as the value.

This gives us a lightweight, low-complexity solution that’s well-suited to our use case of short-lived, low-query contexts.

Real-World Example: Why We Chose a Key-Value Store

Photo by Gavin Allanwood on Unsplash

At Planck, we answer insurance-related questions about businesses. A typical request begins with a business name and address, and then we retrieve real-time data about that specific business, including its online presence, registrations, and other public records. This data becomes our context, and we use LLMs and algorithms to answer questions based on it.

The important bit is that every time we get a request, we generate a fresh context. We’re not reusing existing data, it’s fetched on demand and remains relevant for a few minutes at most.

If you think back to the earlier benchmark, this pattern should already be triggering your “this is not a vector DB use case” sensor.

Every time we receive a request, we generate fresh embeddings for short-lived data that we’ll likely query only a few hundred times. Indexing those embeddings in a vector DB adds unnecessary latency. In contrast, with Redis, we can immediately store the embeddings and run a quick similarity search in the application code with almost no indexing delay.

That’s why we chose Redis instead of a vector database. While vector DBs are excellent at handling large volumes of embeddings and supporting fast nearest-neighbor queries, they introduce indexing overhead, and in our case, that overhead is not worth it.

In Conclusion

If you need to store millions of embeddings and support high-query workloads across a shared corpus, a vector DB would be a better fit. And yes, there are definitely use cases out there that truly need and benefit from a vector DB.

But just because you’re using embeddings or building a RAG system doesn’t mean you should default to a vector DB.

Each database technology has its strengths and trade-offs. The best choice begins with a deep understanding of your data and use case, rather than mindlessly following the trend.

So, the next time you need to choose a database, pause for a moment and ask: am I choosing the right one based on objective trade-offs, or am I just going with the trendiest, shiniest choice?

Source link

#VectorDB