The Rise of Semantic Entity Resolution

[ad_1]

This post introduces the emerging field of semantic entity resolution for knowledge graphs, which uses language models to automate the most painful part of building knowledge graphs from text: deduplicating records. Knowledge graphs extracted from text power most autonomous agents, but these contain many duplicates. The work below includes original research, so this post is necessarily technical.

Semantic entity resolution uses language models to bring an increased level of automation to schema alignment, blocking (grouping records into smaller, efficient blocks for all-pairs comparison at quadratic, n² complexity), matching and even merging duplicate nodes and edges. In the past, entity resolution systems relied on statistical tricks such as string distance, static rules or complex ETL to schema align, block, match and merge records. Semantic entity resolution uses representation learning to gain a deeper understanding of records’ meaning in the domain of a business to automate the same process as part of a knowledge graph factory.

TLDR

The same technology that transformed textbooks, customer service and programming is coming for entity resolution. Skeptical? Try the interactive demos below… they show potential 🙂

Don’t Just Say It: Prove It

I don’t want to convince you, I want to convert you with interactive demos in each post. Try them, edit the data, see what they can do. Play with it. I hope these simple examples proves the potential of a semantic approach to entity resolution.

This post has two demos. In the first demo we extract companies from news plus wikipedia for enrichment. In the second demo we deduplicate those companies in a single prompt using semantic matching.
In a second post I’ll demonstrate semantic blocking, a term I define as meaning “using deep embeddings and semantic clustering to build smaller groups of records for pairwise comparison.”
In a third post I’ll show how semantic blocking and matching combine to improve text-to-Cypher of a real knowledge graph in KuzuDB.

Agent-Based Knowledge Graph Explosion!

Why does semantic entity resolution matter at all? It’s about agents!
Autonomous agents are hungry for knowledge, and recent models like Gemini 2.5 Pro make extracting knowledge graphs from text easy. LLMs are so good at extracting structured information from text that there will be more knowledge graphs built from unstructured data in the next eighteen months than have ever existed before. The source of most web traffic is already hungry LLMs consuming text to produce structured information. Autonomous agents are increasingly powered by text to query of a graph database via tools like Text2Cypher.

The semantic web turned out to be highly individualistic: every company of any size is about to have their own knowledge graph of their problem domain as a core asset to power the agents that automate their business.

Subplot: Powerful Agents Need Entity Resolved KGs

Companies building agents are about to run straight into entity resolution for knowledge graphs as a complex, often cost-prohibitive problem preventing them from harnessing their organizational knowledge. Extracting knowledge graphs from text with LLMs produces large numbers of duplicate nodes and edges. Garbage in: garbage out. When concepts are split across multiple entities, wrong answers emerge. This limits raw, extracted graphs’ ability to power agents. Entity resolved knowledge graphs are required for agents to do their jobs.

Entity Resolution for Knowledge Graphs

There are several steps to entity resolution for knowledge graphs to go from raw data to retrievable knowledge. Let’s define them to understand how semantic entity resolution improves the process.

Node Deduplication

A low cost blocking function groups similar nodes into smaller blocks (groups) for pairwise comparison, because it scales at n² complexity.
A matching function makes a match decision for each pair of nodes within each block, often with a confidence score and an explanation.
New SAME_AS edges are created between each matched pair of nodes.
This forms clusters of connected nodes called connected components. One component corresponds to one resolved record.
Nodes in components are merged — fields may become lists, which are then deduplicated. Merging nodes can be automated with LLMs.

The diagram below illustrates this process:

A Survey of Blocking and Filtering Techniques for Entity Resolution, Papadakis et al, 2020

Edge Deduplication

Merged nodes combine the edges of the source nodes, which includes duplicates of the same type to combine. Blocking for edges is simpler, but merging can be complex depending on edge properties.

Edges are GROUPED BY their source node id, destination node id and edge type to create edge blocks.
An edge matching function makes a match decision for each pair of edges within an edge block.
Edges are then merged using rules for how to combine properties like weights.

The resulting entity resolved knowledge graph now accurately represents expertise in the problem domain. Text2Cypher over this knowledge base becomes a powerful way to drive autonomous agents… but not before entity resolution occurs.

Where Existing Tools Come up Short

Entity resolution for knowledge graphs is a difficult problem, so existing ER tools for knowledge graphs are complex. Most entity linking libraries from academia aren’t effective in real world scenarios. Commercial entity resolution products are stuck in a SQL centric world, often limited to people and company records and can be prohibitively expensive, especially for large knowledge graphs. Both sets of tools match but don’t merge nodes and edges for you, which requires a lot of manual effort through complex ETL. There is an acute need for the simpler, automated workflow semantic entity resolution represents.

Semantic Entity Resolution for Graphs

Modern semantic entity resolution schema aligns, blocks, matches and merges records using pre-trained language models: deep embeddings, semantic clustering and generative AI. It can group, match and merge records in an automated process, using the same transformers that are replacing so many legacy systems because they comprehend the actual meaning of data in the context of a business or problem domain.

Semantic ER isn’t new: it has been state-of-the-art since Ditto used BERT to both block and match in the landmark 2020 paper Deep Entity Matching with Pre-Trained Language Models (Li et al, 2020), beating previous benchmarks by as much as 29%. We used Ditto and BERT do entity resolution for billions of nodes at Deep Discovery in 2021. Both Google and Amazon have semantic ER offerings… what is new is its simplicity, making it more accessible to developers. Semantic blocking still uses sentence transformers, with today’s powerful embeddings. Matching has transitioned from custom transformer models to large language models. Merging with language models emerged just this year. It continues to evolve.

Semantic Blocking: Clustering Embedded Records

Semantic blocking uses the same sentence transformer models powering today’s Retrieval Augmented Generation (RAG) systems to convert records into dense vector representations for semantic retrieval using vector similarity measures like cosine similarity. Semantic blocking uses semantic clustering on the fixed-length vector representations provided by sentence encoder models (i.e. sbert) to group records likely to match based on their semantic similarity in the terms of the data’s problem domain.

Each dimension in a semantic embedding vector has its own meaning, Meet AI’s multitool: Vector embeddings

Semantic clustering is an efficient method of blocking that results in smaller blocks with more positive matches because unlike traditional syntactic blocking methods that employ string similarity measures to form blocking keys to group records, semantic clustering leverages the rich contextual understanding of modern language models to capture deeper relationships between the fields of records, even when their strings differ dramatically.

You can see semantic clusters emerge in this vector similarity matrix of semantic representations below: they’re the blocks along the diagonals… and they can be beautiful 🙂

You shall know an object by the company it keeps: An investigation of semantic representations derived from object co-occurrence in visual scenes, Sadeghi et al, 2015

While off-the-shelf, pre-trained embeddings can work well, semantic blocking can be greatly enhanced by fine-tuning sentence transformers for entity resolution. I’ve been working on exactly that using contrastive learning for people and company names in a project called Eridu (huggingface). It’s a work in progress, but my prototype address matching model works surprisingly well using synthetic data from GPT4o. You can fine-tune embeddings to both cluster and match.

I’ll demonstrate the specifics of semantic blocking in my second post. Stay tuned!

Align, Match and Merge Records with LLMs

Prompting Large Language Models to both match and merge two or more records is a new and powerful technique. The latest generation of Large Language Models is surprisingly powerful for matching JSON records, which shouldn’t be surprising given how well they can perform information extraction. My initial experiment used BAML to match and merge company records in a single step and worked surprisingly well. Given the rapid pace of improvement in LLMs, it isn’t hard to see that this is the future of entity resolution.

Can an LLM be trusted to perform entity resolution? This should be judged on merit, not preconception. It is strange to think that LLMs can be trusted to build knowledge graphs whole-cloth, but can’t be trusted to deduplicate their entities! Chain-of-Thought can be employed to produce an explanation for each match. I discuss workloads below, but as the diversity of knowledge graphs expands to cover every business and its agents, there will be a strong demand for simple ER solutions extending the KG construction pipeline using the same tools that make it up: BAML, DSPy and LLMs.

Low-Code Proof-of-Concept

There are two interactive Prompt Fiddle demos below. The entities extracted from the first demo are used as records to be entity resolved in the second.

Extracting Companies from News and Wikipedia

The first demo is an interactive demo showing how to perform information extraction from news and Wikipedia using BAML and Gemini 2.5 Pro. BAML models are based on Jinja2 templates and define what semi-structured data is extracted from a given prompt. They can be exported as Pydantic models, via the baml-cli generate command. The following demo extracts companies from the Wikipedia article on Nvidia.

Click for live demo: Interactive demo of information extraction of companies using BAML + Gemini – Prompt Fiddle

I’ve been doing the above for the past three months for my investment club and… I’ve hardly found a single mistake. Any time I’ve thought a company was erroneous, it was actually a good idea to include it: Meta when Llama models were mentioned. By comparison, state-of-the-art, traditional information extraction tools… don’t work very well. Gemini is far ahead of other models when it comes to information extraction… provided you use the right tool.

BAML and DSPy feel like disruptive technologies. They provide enough accuracy LLMs become practical for many task. They are to LLMs what Ruby on Rails was to web development: they make using LLMs joyous. So much fun! An introduction to BAML is here and you can also check out Ben Lorica’s show about BAML.

A truncated version of the company model appears below. It has 10 fields, most of which won’t be extracted from any one article… so I threw in Wikipedia, which gets most of them. The question marks after properties like exchange string?mean optional, which is important because BAML won’t extract an entity missing a required field. @description gives guidance to the LLM in interpreting the field for both extraction and matching and merging.

Note the type annotations used in the schema guide the process of schema alignment, matching and merging!

Semantic ER Accelerates Enrichment

Once entity resolution is automated, it becomes trivial to flesh out any public facing entity using the wikipedia PyPi package (or a commercial API like Diffbot or Google Knowledge Graph), so in the examples I included Wikipedia articles for some companies, along with a pair of articles about NVIDIA and AMD. Enriching public facing entities from Wikipedia was always on the TODO list when building a knowledge graph but… so often up to now, it didn’t get done due to the overhead of schema alignment, entity resolution and merging records. For this post, I added it in minutes. This convinced me there will be a lot of downstream impact from the rapidity of semantic ER.

Semantic Multi-Match-Merge with BAML, Gemini 2.5 Pro

The second demo below performs entity matching on the Company entities extracted during the first demo, along with several more company Wikipedia articles. It merges all 39 records at once without a single mistake! Talk about potential!? It is not a fast prompt… but you don’t actually need Gemini 2.5 Pro to do it, faster models will work and LLMs can merge many more records than this at once in a 1M token window… and rising fast 🙂

Click for live demo: LLM MulitMatch + MultiMerge – Prompt Fiddle

Merging Guided by Field Descriptions

If you look, you’ll find that the merge of companies above automatically chooses the full company name when multiple forms are present owing to the description of the Company.name field description Formal name of the company with corporate suffix. I didn’t have to give that instruction in the prompt! It is possible to use record metadata to guide schema alignment, matching and merging without directly editing a prompt. Along with merging multiple records in an LLM, I believe this is original work… I stumbled into 🙂

The field annotation in the BAML schema:

class Company {
  name string
  @description("Formal name of the company with corporate suffix")
  ...
}

The original two records, one extracted from news, the other from Wikipedia:

{
  "name": "Nvidia Corporation",
  "ticker": {
    "symbol": "NVDA",
    "exchange": "NASDAQ"
  },
  "description": "An American technology company, founded in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant player in the AI, gaming, and data center markets, it is led by CEO Jensen Huang and headquartered in Santa Clara, California.",
  "website_url": "null",
  "headquarters_location": "Santa Clara, California, USA",
  "revenue_usd": 10918000000,
  "employees": null,
  "founded_year": 1993,
  "ceo": "Jensen Huang",
  "linkedin_url": "null"
}
{
  "name": "Nvidia",
  "ticker": null,
  "description": "A company specializing in GPUs and full-stack AI computing platforms, including the GB200 and Blackwell series, and platforms like DGX Cloud.",
  "website_url": "null",
  "headquarters_location": "null",
  "revenue_usd": null,
  "employees": null,
  "founded_year": null,
  "ceo": "null",
  "linkedin_url": "null"
}

The matched and merged record below. Note the longer Nvidia Corporation was selected without specific guidance based on the field description. Also, the description is a summary of both the Nvidia mention in the article and the wikipedia entry. And no, the schemas don’t have to be the same 🙂

{
  "name": "Nvidia Corporation",
  "ticker": {
    "symbol": "NVDA",
    "exchange": "NASDAQ"
  },
  "description": "An American technology company, founded in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant player in the AI, gaming, and data center markets, it is led by CEO Jensen Huang and headquartered in Santa Clara, California.",
  "website_url": "null",
  "headquarters_location": "Santa Clara, California, USA",
  "revenue_usd": 10918000000,
  "employees": null,
  "founded_year": 1993,
  "ceo": "Jensen Huang",
  "linkedin_url": "null"
}

Below is the prompt, all pretty and branded for a slide:

This simple prompt both matches and merges 39 records in the above demo, guided by the type annotations.

Now to be clear: there’s a lot more than matching in a production entity resolution system… you need to assign unique identifiers to new records and include the merged IDs as a field, to keep track of which records were merged… at a minimum. I do this in my investment club’s pipeline. My goal is to show you the potential of semantic matching and merging using large language models… if you’d like to take it further, I can help. We do that at Graphlet AI 🙂

Schema Alignment? Coming Up!

Another tough problem in entity resolution is schema alignment: different sources of data for the same type of entity have fields that don’t exactly match. Schema alignment is a painful process that normally occurs before entity resolution is possible… with semantic matching and similar names or descriptions, schema alignment just happens. The records being matched and merged will align using the power of representation learning… which understands that the underlying concepts are the same, so the schemas align.

Beyond Matching

An interesting aspect of doing multiple record comparisons at once is that it provides an opportunity for the language model to observe, evaluate and comment on the group of records in the prompt. In my own entity resolution pipeline, I combine and summarize multiple descriptions of companies in Company objects, extracted from different news articles, each of which summarizes the company as it appears in that particular article. This provides a comprehensive description of a company in terms of its relationships not otherwise available.

I believe there are many opportunities like this, given that even last year’s LLMs can do linear and non-linear regression… check out From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples (Vacareanu et al, 2024).

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples, Vacareanu 2024.

There is no end to the observations an LLM might make about groups of records: tasks related to entity resolution, but not limited to it.

Cost and Scalability

The early, high cost of large language model APIs and the historical high price of GPU inference have created skepticism about whether semantic entity resolution can scale.

Scaling Blocking via Semantic Clustering

Matching in entity resolution for knowledge graphs is just link prediction of SAME_AS edges, a common graph machine learning task. There is little question that semantic clustering for link prediction can cost-efficiently scale, as the technique was proven at Google by Google Grale (Halcrow et al, 2020, NeurIPS presentation). That paper’s authors include graph learning luminary Bryan Perozzi, recent winner of KDD’s Test of Award for his invention of graph embeddings.

It scales for Google… Grale: Designing Networks for Graph Learning, Johnathan Halcrow, Google Research

Semantic clustering in Grale is a crucial part of the machine learning behind many features across Google’s web properties, including recommendations at YouTube. Note that Google also uses language models to match nodes during link prediction in Grale 🙂 Google also uses semantic clustering in its Entity Reconciliation API for its Enterprise Knowledge Graph service.

Clustering in Grale uses Locality Sensitive Hashing (LSH). Another efficient method of clustering via information retrieval is to use L2 / Approximate K-Nearest Neighbors clustering in a vector database such as Facebook FAISS (blog post) or Milvus. In FAISS, records are clustered during indexing and may be retrieved as groups of similar records via A-KNN.

I’ll talk more about scaling semantic blocking in my second post!

Scaling Matching via Large Language Models

Large Language Models are resource intensive and employ GPUs for efficiency in both training and inference. There are three reasons to be optimistic about their effiency for entity resolution.

1. LLMs are constantly, rapidly becoming less expensive… don’t match your budget today? Wait a month.

State of Foundation Models, 2025 by Innovation Endeavors

…and more capable. Not accurate enough today? Wait a week for the new best model. Given time, your satisfaction is inevitable.

The economics of matching via an LLM were first explored in Cost-Efficient Prompt Engineering for Unsupervised Entity Resolution (Nananukul et al, 2023). The authors include Mayank Kejriwal, who wrote the bible of KGs. They achieved surprisingly accurate results, given how bad GPT3.5 now looks.

2. Semantic blocking can be more effective, meaning smaller blocks with more positive matches. I’ll demonstrate this process in my next post.

3. Multiple records, even multiple blocks, can be matched simultaneously in a single prompt, given that modern LLMs have 1 million token context windows. 39 records match and merge at once in the demo above, but ultimately, thousands will at once.

In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration, Fu et al, 2025.

Skepticism: A Tale of Two Workloads

Some workloads are appropriate for semantic entity resolution today, while others are not yet. Let’s explore what works today and what doesn’t.

Semantic entity resolution is best suited for knowledge graphs that have been extracted from unstructured text using a large language model — which you already trust to generate the data. You also trust embeddings to retrieve the data. Why wouldn’t you trust embeddings to block your data into matching groups, followed by an LLM to match and merge records?

Modern LLMs and tools like BAML are so powerful for information extraction from text that the next two years will see a proliferation of knowledge graphs covering both traditional domains like science, e-commerce, marketing, finance, manufacturing and biomedicine to… anything and everything: sports, fashion, cosmetics, hip-hop, crafts, entertainment, non-fiction (every book gets a KG), even fiction (I predict a massive Cthulhu Mythos KG… which I may now build). These kinds of workloads will skip traditional entity resolution tools entirely and perform semantic entity resolution as another step in their KG construction pipelines.

Idempotence for Entity Resolution

Semantic entity resolution isn’t ready for finance and medicine, both of which have strict idempotence (reproducibility) as a legal requirement. This has led to scare tactics that pretend this applies to all workloads.

LLM output varies for several reasons. GPUs execute multiple threads concurrently that finish in varying orders. There are hardware and software settings to reduce or remove variation to improve consistency at a performance hit, but it isn’t clear these remove all variation even on the same hardware. Strict idempotence is only possible when hosting large language models on the same hardware between runs using a variety of hardware and software settings and at a performance penalty… it requires a proof-of-concept. That is likely to change via specific hardware designed for financial institutions as LLMs take over the rest of the world. Regulations are also likely to change over time to accommodate statistical precision rather than precise determinism.

For explanations of matching and merging records, idempotent workloads must also address the fact that Reasoning Models Don’t Always Say What They Think (Chen et al, 2025). See more recently, Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, Zhao et al, 2025. This is possible with sufficient validation using emerging tools like prompt tuning for accurate, fully reproducible behavior.

Data Provenance

If you use semantic methods to block, match and merge for existing entity resolution workloads, you must still track the reason for a match and maintain data provenance: a complete lineage of records. This is hard work! That means that most businesses will choose a tool that leverages language models, rather than doing their own entity resolution. Keep in mind that most knowledge graphs two years from now will be new knowledge graphs built by large language models in other domains.

Abzu Capital

I’m not a vendor selling you a product… I strongly believe in open source, open data tools. I’m in an investment club that built an entity resolved knowledge graph of AI, robotics and data-center related industries using this technology. We wanted to invest in smaller technology companies with high growth potential that cut deals and form strategic relationships with bigger players with large capital expenditures… but reading form 10-K reports, tracking the news and adding up the deals for even a handful of investments became a full time job. So we built agents powered by a knowledge graph of companies, technologies and products to automate the process! This is the place from which this post comes.

Conclusion

In this post, we explored semantic entity resolution. We demonstrated proof-of-concept information extraction and entity matching using Large Language Models (LLMs). I encourage you to play with the provided demos and come to your own conclusions about semantic entity matching. I think the simple result above, combined with the other two posts, will show early adopters this is the way the market will turn, one workload at a time.

Up Next…

This is the first post in a series of three posts. In the second post, I will demonstrate semantic blocking by semantic clustering of sentence encoded records. In my final post, I’ll provide an end-to-end example of semantic entity resolution to improve text-to-cypher on a real knowledge graph for a real-world use case. Stick around, I think you’ll be pleased 🙂

At Graphlet AI we build autonomous agents powered by entity resolved knowledge graphs for companies large and small. We build large knowledge graphs from structured and unstructured data: millions, billions or trillions of nodes and edges. I lead the Spark GraphFrames project, widely used in entity resolution for connected components. I have a 20 year background and teach network science, graph machine learning and NLP. I built and product managed LinkedIn InMaps and Career Explorer. I was a visualization engineer at Ning (Marc Andreesen’s social network), evangelist at Hortonworks and Principal Data Scientist at Walmart. I coined the term “agile data science” in 2009 (from 0 hits on Google) and wrote the first agile data science methodology in Agile Data Science (O’Reilly Media, 2013). I improved it in Agile Data Science 2.0 (O’Reilly Media, 2017), which has a 4-star rating on Amazon 8 years later (code still works). I wrote the 1st fully data-driven market report for O’Reilly Media in 2015. I’m an Apache Committer on DataFu, I wrote the Apache Druid onboarding docs, and I maintain graph sampler Little Ball of Fur and graph embedding collection Karate Club.

This post originally appeared on the Graphlet AI Blog.

Source link

#Rise #Semantic #Entity #Resolution

[ad_2]