This article is a continuation of topic modeling open-source intelligence (OSINT) from the OpenAlex API. In a previous article, I give an introduction into topic modeling, the data used, and a traditional NLP approach using Latent Dirichlet Allocation (LDA).
See the previous article here:
This article uses a more advanced approach of topic modeling by leveraging representation models, generative AI, and other advanced techniques. We leverage BERTopic to bring several models together into one pipeline, visualize our topics, and explore variations of topic models.
The BERTopic Pipeline
Using a traditional approach to topic modeling can be difficult, needing to build your own pipeline to clean your data, tokenize, lemmatize, create features, etc. Traditional models like LDA or LSA are also computationally expensive and often yield poor results.
BERTopic leverages the transformer architecture through embedding models, and incorporates other components like dimensionality reduction and topic representation models, to create high-performing topic models. BERTopic also provides variations of models to fit a variety of data and use cases, visualizations to explore results, and more.
The biggest advantage of BERTopic is its modularity. Seen above, the pipeline is comprised of several different models:
- Embedding model
- Dimensionality Reduction model
- Clustering model
- Tokenizer
- Weighting Scheme
- Representation model (optional)
Therefore, we can experiment with different models in each component, each with its own parameters. For example, we can try different embedding models, switch the dimensionality reduction from PCA to UMAP, or try fine-tuning the parameters on our clustering model. This is a huge advantage that allows us to fit a topic model to our data and use case.
First, we need to import to necessary modules. Most of these are to build the components of our BERTopic model.
#import packages for data management
import pickle
#import packages for topic modeling
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from umap.umap_ import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
#import packages for data manipulation and visualization
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as sch
Embedding Model
The main component of the BERTopic model is the embedding model. First, we initialize the model using the sentence transformer. You can then specify the embedding model you would like to use.
In this case, I am using a relatively small model (~30 million parameters). While we can probably get better results using larger embedding models, I decided to use a smaller model to emphasize speed in this pipeline. You can find and compare embedding models based on their size, performance, intended use, etc. by using the MTEB leaderboard from Hugging Face (https://huggingface.co/spaces/mteb/leaderboard).
#initalize embedding model
embedding_model = SentenceTransformer('thenlper/gte-small')
#calculate embeddings
embeddings = embedding_model.encode(data['all_text'].tolist(), show_progress_bar=True)
Once we run our model, we can use the .shape function to see the size of the vectors produced. Below, we can see that each embedding contains 384 values which make up the meaning of each document.
#invesigate shape and size of vectors
embeddings.shape
#output: (6102, 384)
Dimensionality Reduction Model
The next component of the BERTopic model is the dimensionality reduction model. As high-dimensional data can be troublesome to model, we can use a dimensionality reduction model to represent the embeddings in a lower dimensional representation without losing too much information.
There are several different types of dimensionality reduction models, with Principal Component Analysis (PCA) being the most popular. In this case, we will use a Uniform Manifold Approximation and Projection (UMAP) model. The UMAP model is a non-linear model and is likely to better handle the complex relationships in our data better than PCA.
#initialize dimensionality reduction model and reduce embeddings
umap_model = UMAP(n_neighbors=5, min_dist=0.0, metric='cosine', random_state=42)
reduced_embeddings = umap_model.fit_transform(embeddings)
It is important to note that dimensionality reduction is not a solve-all for high-dimensional data. Dimensionality reduction presents a tradeoff between speed and accuracy as information is lost. These models need to be well-thought out and experimented with to avoid losing too much information while maintaining speed and scalability.
Clustering Model
The third step is to use the reduced embeddings and create clusters. While clustering is not usually necessary for topic modeling, we can leverage density-based clustering models to isolate outliers and eliminate noise in our data. Below, we initialize the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) model and create our clusters.
#initialize clustering model and cluster
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom').fit(reduced_embeddings)
clusters = hdbscan_model.labels_
A density-based approach gives us a few advantages. Documents are not forced into clusters that they shouldn’t be assigned to, therefore isolating outliers and reducing noise in our data. Also, as opposed to centroid-based models, we do not specify the number of clusters, and clusters are more likely to be well-defined.
See my guide to clustering algorithms:
See the code below to visualize the results of the clustering model.
#create dataframe of reduced embeddings and clusters
df = pd.DataFrame(reduced_embeddings, columns = ['x', 'y'])
df['Cluster'] = [str(c) for c in clusters]
#split between clusters and outliers
to_plot = df.loc[df.Cluster != '-1', :]
outliers = df.loc[df.Cluster == '-1', :]
#plot clusters
plt.scatter(outliers.x, outliers.y, alpha = 0.05, s = 2, c = 'grey')
plt.scatter(to_plot.x, to_plot.y, alpha = 0.6, s = 2, c = to_plot.Cluster.astype(int), cmap = 'tab20b')
plt.axis('off')
We can see well-defined clusters that do not overlap. We can also see some smaller clusters group together to make up higher-level topics. Lastly, we can see several documents are greyed out and identified as outliers.
Creating a BERTopic Pipeline
We now have the necessary components to build our BERTopic pipeline (embedding model, dimensionality reduction model, clustering model). We can use the models we have initialized and fit them to our data using the BERTopic function.
#use models above to BERTopic pipeline
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
verbose = True).fit(data['all_text'].tolist(), embeddings)
Since I know I ingested papers about human-machine interfaces (augmented reality, virtual reality), let’s see which topic align to the term “augmented reality”.
#topics most similar to 'augmented reality'
topic_model.find_topics("augmented reality")
#output: ([18, 3, 16, 24, 12], [0.9532771, 0.9498462, 0.94966936, 0.9451431, 0.9417263])
From the output above, we can see that topics 18, 3, 16, 24, and 12 highly align to the term “augmented reality”. All these topic should (hopefully) contribute to the broader theme of augmented reality, but each cover a different aspect.
To confirm this, let’s investigate the topic representations. A topic representation is a list of terms that aims to properly represent the underlying theme of the topic. For example, the terms “cake”, “candles”, “family”, and “presents” may collectively represent the topic of birthdays or birthday parties.
We can use the get_topic() function to investigate the representation of topic 18.
#investigate topic 18
topic_model.get_topic(18)
In the above representation, we see some useful terms like “reality”, “virtual”, “augmented”, etc. However, this is not useful as a whole, as we see several stop words like “and” and “the”. This is because BERTopic uses Bag of Words as the default way to represent topics. This representation may also match other representations about augmented reality.
Next, we will improve our BERTopic pipeline to create more meaningful topic representations that are give us more insight into these themes.
Improving Topic Representations
We can improve the topic representations by adding a weighting scheme, which will highlight the most important terms and better differentiate our topics.
This does not replace the Bag of Words model, but improves upon it. Below we add a TF-IDF model to better determine the importance of each term. We use the update_topics() function to update our pipeline.
#initialize tokenizer model
vectorizer_model = CountVectorizer(stop_words="english")
#initialize ctfidf model to weight terms
ctfidf_model = ClassTfidfTransformer()
#add tokenizer and ctfidf to pipeline
topic_model.update_topics(data['all_text'].tolist(), vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model)
#investigate how topic representations have changed
topic_model.get_topic(18)
With TF-IDF, these topic representations are much more useful. We can see that the meaningless stop words are gone, other terms appear that help describe the topic, and terms are reordered by their importance.
But we do not have to stop here. Thanks to countless new developments in the world of AI and NLP, there are methods we can leverage to fine-tune these representations.
To fine-tune, we can take one of two approaches:
- A representation model
- A generative model
Fine-Tuning with a Representation Model
First, let’s add the KeyBERTInspired model as our representation model. This leverages BERT to compare the semantic similarity of the TF-IDF representations with the documents themselves to better determine the relevance of each term, rather than the importance.
See all representation model options here: https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired
#initilzae representation model and add to pipeline
representation_model = KeyBERTInspired()
topic_model.update_topics(data['all_text'].tolist(), vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, representation_model=representation_model)
Here, we see a fairly major change in the terms, with some additional terms and acronyms. Comparing this to the TF-IDF representation, we again get a better understanding of what this topic is about. Also notice that the scores changed from the TF-IDF weights, which did not have any meaning without context, to scores between 0–1. These new scores represent the semantic similarity scores.
Topic Model Visualizations
Before we move to generative models for fine-tuning, let’s explore some of the visualizations that BERTopic offers. Visualizing topic models is crucial in understanding your data and how the model is working.
First, we can visualize our topics in a 2-dimensional space, allowing us to see the size of topics and what other topics are similar. Below, we can see we have many topics, with clusters of topics making up larger themes. We can also see a topic that is large and isolated, indicating that there is a lot of similar research regarding crispr.
Let’s zoom into these clusters of topics to see how they break down higher-level themes. Below, we zoom into topics regarding augmented and virtual reality and see how some topics cover different domains and applications.
We can also quickly visualize the most important or most relevant terms in each topic. Again, this is dependent on your approach to the topic representations.
We can also use a heatmap to explore the similarity between topics.
These are just a few of the visualizations that BERTopic offers. See the full list here: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html
Leveraging Generative Models
For our last step of fine-tuning our topic representations, we can leverage generative AI to produce representations that are coherent descriptions of the topic.
BERTopic offers an easy way to leverage OpenAI’s GPT models to interact with the topic model. We first establish a prompt that shows the model the data and the current representation of the topics. We then ask it to generate a short label for each topic.
We then initialize the client and model, and update our pipeline.
import openai
from bertopic.representation import OpenAI
#promt for GPT to create topic labels
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following key words: [KEYWORDS]
Based on the information above, extract a short topic label in the following format:
topic:
"""
#import GPT
client = openai.OpenAI(api_key='API KEY')
#add GPT as representation model
representation_model = OpenAI(client, model = 'gpt-3.5-turbo', exponential_backoff=True, chat=True, prompt=prompt)
topic_model.update_topics(data['all_text'].tolist(), representation_model=representation_model)
Now, let’s go back to the augmented reality topic.
#investigate how topic representations have changed
topic_model.get_topic(18)
#output: [('Comparative analysis of virtual and augmented reality for immersive analytics',1)]
The topic representation now reads “Comparative analysis of virtual and augmented reality for immersive analytics”. The topic is now much more clear, as we can see the objectives, technologies, and domain included in these documents.
Below is the full list of our new topic representations.
It does not take much code to see how powerful generative AI is in supporting our topic model and its representations. It is of course extremely important to dig deeper and validate these outputs as you build your model and to do plenty of experimentation with different models, parameters, and approaches.
Leveraging Topic Models Variations
Lastly, BERTopic provides several variations of topic models to provide solutions for different data and use cases. These include time-series, hierarchical, supervised, semi-supervised, and many more.
See the full list and documentation here: https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html
Let’s quickly explore one of these possibilities with hierarchical topic modeling. Below, we create a linkage function using scipy, which establishes distances between our topics. We can easily fit it to our data and visualize the hierarchy of topics.
#create linkages between topics
linkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)
hierarchical_topics = topic_model.hierarchical_topics(data['all_text'], linkage_function=linkage_function)
#visualize topic model hierarchy
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
In the visualization above, we can see how topics put themselves together to create broader and broader topics. For example, we see topics 25 and 30 come together to form “Smart Cities and Sustainable Development”. This model provides an awesome capability of being able to zoom in and out and deciding how broad or narrow we would like our topics to be.
Conclusion
In this article, we got to see the power of BERTopic for topic modeling. BERTopics use of transformers and embedding models dramatically improves results from traditional approaches. The BERTopic pipeline also offers both power and modularity, leveraging several models and allowing you to plug-in other models to fit your data. All of these models can be fine-tuned and put together to create a powerful topic model.
You can also integrate representation and generative models to improve topic representations and improve interpretability. BERTopic also offers several visualizations to truly explore your data and validate your model. Lastly, BERTopic offers several variations of topic modeling, like time-series or hierarchical topic modeling, to better fit your use case.
I hope you have enjoyed my article! Please feel free to comment, ask questions, or request other topics.
Connect with me on LinkedIn: https://www.linkedin.com/in/alexdavis2020/
Source link
#Advanced #Topic #Modeling #LLMs