How To Build a Graph-Based Recommendation Engine Using EDG and Neo4j

In this tutorial, I’ll show you how to manage a taxonomy in EDG and publish it to a Neo4j instance, where it can be populated with additional data to power a recommendation engine. The taxonomy, which is built and maintained in TopQuadrant’s EDG, defines the structure. A set of (fake) academic journal articles serves as the instance data that populates Neo4j. I’ll use a small hierarchy of STEM categories as the taxonomy to organize the articles. This data is covered under the Creative Commons CC0 1.0 Universal Public Domain Dedication.

Note 1: Full disclosure — I work at TopQuadrant, the company that makes EDG, so I’m naturally biased toward the tools I know well. Both Neo4j and TopQuadrant’s EDG are commercial products and not open source. They each offer free trial versions suitable for following along with this tutorial: Neo4j provides one free cloud database instance (with limits on data volume, memory, and CPU), and TopQuadrant offers a 90-day free trial of EDG Desktop. Also, while the architecture outlined here has its benefits, it’s not the only approach, and these aren’t the only vendors capable of supporting this type of workflow. The pros and cons of this approach are listed below.

Note 2: Here is a video recording of what this demo looks like.

Note 3: All images in this post are created by author.

What’s the point of all of this? The point is that a lot of meaning lives in the taxonomy itself. Each article is tagged with the most specific category that applies, but because the taxonomy encodes parent–child relationships, we can infer higher-level associations automatically. For example, if an article is tagged with Mathematical Software, it’s also about Computer Science and STEM, even if it isn’t explicitly tagged that way. The taxonomy doesn’t just classify, it enables reasoning over how topics relate, so the data source only needs to record the most relevant tag, and the hierarchy fills in the rest.

We are separating the instance level information on what an individual article is about from the meta information about the topics themselves and how they relate to each other.

The reasons you’d want to build with this kind of architecture are:

Inferencing: Tag with one concept but use the taxonomy to associate many other concepts to the content. Instead of tagging an article with Mathematical Software and Computer Science, I can just tag it with Mathematical Software. The taxonomy knows that Mathematical Software is a branch of Computer Science. The parent concept, Computer Science, can be inferred based on the taxonomy.

Aligning multiple systems: I can use one taxonomy to build a recommendation engine in Neo4j and a GraphRAG application in GraphDB. One team can use vector-based tagging on content stored in SharePoint while another uses NLP rule-based tagging on content stored in Adobe Experience Manager (AEM). All of these apps are aligned because they’re all using the same reference data.

Change management: If I want to recategorize Mathematical Software as a branch of Mathematics rather than a branch of Computer Science, I just need to change its parent in the taxonomy. If I don’t have a separate taxonomy, I’d need to retag every document tagged with Mathematical Software. If I have multiple downstream apps using the same list of terms, this becomes a nightmare. I’d need to retag every entity tagged with Mathematical Software in every application and ensure all the other tags associated with that document are correct.

Play to tools’ strengths: EDG is great and managing metadata and taxonomies and ensuring those things are aligned and governed well. Neo4j and other graph databases are great at high-performance graph analytics at scale but struggle with the metadata management side of things. With this set up, we can get the best of both worlds

There are other architectural approaches to building something like this, of course, and there are drawbacks to the approach I outline here. Some of the main ones include:

Overkill for simple use cases: This tutorial uses a simple demo, but the architecture makes the most sense when your data and use cases are complex. Most graph databases, including Neo4j, let you define a schema or basic ontology and represent taxonomies with hierarchical relationships. If your data is relatively simple, your taxonomy is straightforward, or only one team needs to use it, you may not need this many tools.

Skillset and learning curve: Using EDG and Neo4j together assumes familiarity with two different paradigms: ontology modeling in RDF/SHACL and graph querying in property graphs/Cypher. Many teams are comfortable with one but not the other.

More moving parts: Keeping a taxonomy separate from the data you are tagging means you need to ensure that the tags align with the taxonomy. If they drift, the graph stops fitting together cleanly in the database.

Vendor lock-in: Both Neo4j and EDG are commercial products so there is always going to be some lock-in and potential migration costs. The standards underlying EDG (RDF, SHACL, and SPARQL), are open source standards from the W3C, which does mitigate overall technical lock-in.

Neo4j is a labeled property graph (LPG). EDG is a knowledge graph curation tool based in RDF and SHACL. LPGs and RDF are two different graph technologies that, historically, have not been compatible. EDG has recently built a Neo4j integration feature, however, which allows users to build using both technologies.

Below is a visual representation of how these two technologies can work together.

At the bottom in pink, you have data storage. I have this split into internal data and external data. Internal data is the raw data you could be storing in a data lake, a content management system (CMS) like SharePoint, or a relational database. There may also be external datasets you want to integrate into your app. These could be public, free data sources like WikiData, upper level ontologies like gist, or proprietary reference datasets like SNOMED or MedDRA (medical taxonomies).

EDG can then act as the semantic layer between the underlying data and downstream apps. You can manage your ontologies, taxonomies, reference data, and metadata in one place and push what you need to applications like Neo4j as needed. You can also load data directly from your underlying data sources into Neo4j or any other application.

Step 1: Get free versions of EDG and Neo4j

First, we are going to need to get free versions of these products to play around with.

For EDG, you’ll need to go to this website and request a free trial. You’ll get a link to download EDG along with a license in an email. After the download completes, there is an executable file in the edg folder, also called edg. Double click that and it should start running in your browser. If you don’t have Java installed, it will prompt you to install Java first.

EDG will then open in your browser in a new tab called something like http://localhost:8083/. But it will say it is not registered. Click on Product Registration and then upload the license file that was also sent in the email. Then click “Register Product”.

After uploading the license, you can go back to the home screen by clicking the TopQuadrant logo in the top left corner. Now you should be able to see the main EDG landing page.

Now we need a free version of Neo4j. Go to this link to get started with your free trial. If you don’t have an account already, you will need to make one. After you create a Neo4j account you will land on a screen like this:

Click “Create instance” and then select the free option.

When you click “Create instance” you will be shown your username and password. The username is usually just “Neo4j” but the password is unique, so write it down somewhere.

Step 2: Set up integration

In EDG, in the top right corner, click on the user icon (it looks like a person). Then click “Server Administration”. This will take you to a screen with a bunch of options. Click “Product Configuration Parameters”. On the left toolbar you will see a bunch of integration options. Click “Neo4j”.

You can configure this to push to multiple Neo4j databases, but for this tutorial we will just point to the Neo4j instance we just created. On the right side of the empty Neo4j database line there is a plus sign. Click that and you will be prompted to enter the Neo4j credentials.

You can name this configuration anything but I chose “neo4jtest1”. The ID should be autofilled by EDG. For the Neo4j database URL, you will need to inspect the Neo4j instance you created in Neo4j. It will look something like this: neo4j+s://cd227570.databases.neo4j.io.

Click “Create and Select”. Now you will need to enter your password. This is the one that Neo4j gave you when you created your Neo4j instance.

Now we are all configured.

Step 3: Import taxonomy

Go to my GitHub and download this taxonomy. This is a list of STEM topics in a hierarchy i.e. a taxonomy.

Click “New +” at the top of the screen in EDG then “Import asset collections from TriG or Zip file”. Choose the zip file you got from my GitHub and load it into EDG. Click Finish. When you go to the taxonomy you should see a hierarchical list of a bunch of different STEM categories.

Step 4: Push taxonomy to Neo4j

Click the cloud dropdown to manage integrations. In the dropdown menu you will see the option to “Link to Neo4j Database”.

When you click this you will be able to choose which Neo4j integration you want to use. Click the one you created in step 2 above.

After you select the Neo4j integration, the integration between this taxonomy and your Neo4j instance will be created. It will look like the popup below. Click the integration to navigate to it. In my example below it is called “Integration with Neo4j database neo4jtest1”. Then click “Ok”.

The integration will now appear in the editor and we can change any settings if we want. You’ll notice next to the cloud dropdown there is a icon for pushing to integrated systems that looks like a cloud with an arrow on it.

Click edit and then scroll down to “included classes”. This is where we specify which classes in our taxonomy we want to push to this Neo4j instance. For this tutorial, select “Concept”. This should include everything in the taxonomy. This may seem unnecessary, but it is important for large taxonomies with many kinds of classes.

Also select “always overwrite” to be “True”. This ensures that when we push, we overwrite whatever is in the Neo4j instance.

Now click “Save Changes”.

Back in the editor interface, click the cloud push icon that is in the top toolbar now that we have established a Neo4j integration. A popup should appear that looks like the image below. If we have multiple integrations configured with multiple different applications, we’d see them all here. For this tutorial, you should just see the one you made and it should be automatically selected. Now click “Ok”.

You should see a progress bar of your concepts getting pushed to Neo4j.

Step 5: Explore data in Neo4j

Now go back to your Neo4j Aura instance. If you click Instances on the left toolbar you will see the instance we created in Step 1. Now you will see that there are Nodes and Relationships in it!

You can click “Connect” and then “Explore” which will take you to a visual representation of your graph.

Below is the visual explorer of Neo4j Aura. You can just search on the generic term “Resource – BROADER – Resource” to see all of the concepts we pushed from EDG along with their parent concepts.

Step 6: Upload articles to Neo4j

Download a list of journal articles from my GitHub here. This is a short list of fake academic journal articles. The idea here is that we want the taxonomy to come from EDG but the article metadata to come from somewhere else.

Now in Neo4j, click “Import” on the left toolbar and “New data source”. A list of options will appear. You could import your instance data from anywhere, but for this tutorial we will just upload the csv file directly. The source of data doesn’t matter, what matters is that the instance data is tagged with terms that come from the taxonomy that we are managing in EDG. That is how we can align the article metadata with our taxonomy and broader semantic layer.

Upload the csv you downloaded from my GitHub. You will then be asked how you want to define your model. Select “Generate from schema”.

You’ll see Articles.csv pop up as a node. Click the node. You’ll need to specify which property you want to use as the primary key. There is a property in this list of articles called “id” which we will use as the primary key. To set this as the key, click the key icon in the bottom right for the “id” row. Then select “Run Import”.

You will be prompted to enter the password for this instance, which is the one you wrote down at the beginning. It will take a second to run but then you will get this popup of Import results.

You can see that 15 nodes were created. The csv file contained 15 articles and each of them became a node. Now we can go back to the Explore feature and search for “Articles.csv”. You’ll see Articles show up in the visual in pink alongside the STEM categories in green. This is great but they are not yet linked. To connect the instance data (articles) to the categories, we need to run a cypher query.

Step 7: Connect instance data with taxonomy

Click Query in the left toolbar. In the query box enter:

// 1) Match every imported article node that has a topicUri
MATCH (a:`Articles.csv`)
WHERE a.topicUri IS NOT NULL

// 2) Find the corresponding Concept by its uri property
MATCH (c:Concept {uri: a.topicUri})

// 3) Create the TAGGED_WITH relationship (idempotent)
MERGE (a)-[:TAGGED_WITH]->(c)

// 4) Return a sanity check
RETURN count(*) AS totalTaggedRelationships;

It should look like this:

Then press “Run”. You’ll see right under that query something that will say “Created 15 relationships”. That’s a good sign. Now go back to the Explorer. Now search for “Articles.csv – TAGGED_WITH – Resource”. You’ll see that all of those pink nodes are now connected to our green taxonomy!

Step 8: Build a recommendation engine

We are going to run some very basic similarity queries to demonstrate how you’d use the graph we just built for recommendations. First, let’s look at an article and which category it is tagged with. Enter this cypher query into query interface. This will list the categories that the article “Advances in Mathematical Software Studies #7” was tagged with.

MATCH (a:`Articles.csv` {title: 'Advances in Mathematical Software Studies #7'})
MATCH (a)-[:TAGGED_WITH]->(c:Concept)
RETURN a.title AS article, c.prefLabel AS tag, c.uri AS uri
ORDER BY tag;

You should see the following output and the category “Mathematical Software”.

Suppose we want to find articles similar to this page turner because we want to recommend them to potential readers. We can look for other articles that are also tagged with Mathematical Software, but we can also take advantage of taxonomical structure we have in our graph. Mathematical Software is a subclass of Computer Science, according to the STEM taxonomy. You can go back to EDG to explore the categories and their children. For our recommendation engine, to find articles similar to our Mathematical Software article, we want to find other articles that are tagged with Mathematical Software, but ALSO articles tagged with other branches of computer science.

We can do that with the following cypher query:

// 0) Seed article by its real label
MATCH (me:`Articles.csv` {title: 'Advances in Mathematical Software Studies #7'})  

// 1) get each tagged topic plus its parent
MATCH (me)-[:TAGGED_WITH]->(child:Concept)-[:BROADER]->(parent:Concept)  

// 2) find any other article tagged with a sibling under that same parent
MATCH (siblingChild:Concept)-[:BROADER]->(parent)(siblingChild)  
WHERE rec  me  

// 3) compute recommendation score
WITH rec, count(DISTINCT parent) AS score  

// 4) now pull in all the direct tags on each recommended article
OPTIONAL MATCH (rec)-[:TAGGED_WITH]->(t:Concept)  

// 5) return title, score, and full tag list
RETURN 
  rec.title                        AS recommendation,
  score                            AS sharedParentCount,
  collect(DISTINCT t.prefLabel)    AS allTaggedTopics
ORDER BY score DESC, recommendation
LIMIT 5;

You should get the following results:

There are no other articles tagged with Mathematical Software, but there are articles tagged with other branches of computer science. “Advances in Computers and Society Studies” is an article tagged with the category “Computers and Society”. This is recommended because the graph knows that both Computers and Society and Mathematical Software are branches of Computer Science.

Step 9: Adjusting our taxonomy

I mentioned earlier that one reason you’d want to separate your taxonomy from your graph database is so you can make changes to your taxonomy and easily see the downstream effects in your apps. Let’s try that.

Suppose we want to recategorize Mathematical Software as a branch of Mathematics rather than a branch of Computer Science. To do this in our taxonomy, we just drag and drop the term in the tree structure in EDG.

Now push the taxonomy back into Neo4j using the same cloud button.

Now when we go back to Neo4j and run the recommendation algorithm again, the results are totally different. This is because our original article was tagged with Mathematical Software, which we’ve now classified as a branch of Mathematics. The other articles that are recommended to us are other articles about math, not computer science.

Conclusion

This simple demo shows how a taxonomy can bring structure, flexibility, and intelligence to your data applications. By separating your taxonomy (in EDG) from your instance metadata (in Neo4j), you gain the ability to infer relationships, align systems, and evolve your model over time, without having to retag or rebuild downstream apps. The result is a modular architecture that makes your graph smarter as your understanding of the domain grows.

About the author: Steve Hedden is the Head of Product Management at TopQuadrant, where he leads the strategy for EDG, a platform for knowledge graph and metadata management. His work focuses on bridging enterprise data governance and AI through ontologies, taxonomies, and semantic technologies. Steve writes and speaks regularly about knowledge graphs, and the evolving role of semantics in AI systems.

Source link

#Build #GraphBased #Recommendation #Engine #EDG #Neo4j