is a commonly used metric for operationalizing tasks such as semantic search and document comparison in the field of natural language processing (NLP). Introductory NLP courses often provide only a high-level justification for using cosine similarity in such tasks (as opposed to, say, Euclidean distance) without explaining the underlying mathematics, leaving many data scientists with a rather vague understanding of the subject matter. To address this gap, the following article lays out the mathematical intuition behind the cosine similarity metric and shows how this can help us interpret results in practice with hands-on examples in Python.
Note: All figures and formulas in the following sections have been created by the author of this article.
Mathematical Intuition
The cosine similarity metric is based on the cosine function that readers may recall from high school math. The cosine function exhibits a repeating wavelike pattern, a full cycle of which is depicted in Figure 1 below for the range 0 x pi. The Python code used to produce the figure is also included for reference.
import numpy as np
import matplotlib.pyplot as plt
# Define the x range from 0 to 2*pi
x = np.linspace(0, 2 * np.pi, 500)
y = np.cos(x)
# Create the plot
plt.figure(figsize=(8, 4))
plt.plot(x, y, label='cos(x)', color='blue')
# Add notches on the x-axis at pi/2 and 3*pi/2
notch_positions = [0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi]
notch_labels = ['0', 'pi/2', 'pi', '3*pi/2', '2*pi']
plt.xticks(ticks=notch_positions, labels=notch_labels)
# Add custom horizontal gridlines only at y = -1, 0, 1
for y_val in [-1, 0, 1]:
plt.axhline(y=y_val, color='gray', linestyle='--', linewidth=0.5)
# Add vertical gridlines at specified x-values
for x_val in notch_positions:
plt.axvline(x=x_val, color='gray', linestyle='--', linewidth=0.5)
# Customize the plot
plt.xlabel("x")
plt.ylabel("cos(x)")
# Final layout and display
plt.tight_layout()
plt.show()
The function parameter x denotes an angle in radians (e.g., the angle between two vectors in an embedding space), where pi/2, pi, 3*pi/2, and 2*pi, are 90, 180, 270, and 360 degrees, respectively.
To understand why the cosine function can serve as a useful basis for designing a vector similarity metric, notice that the basic cosine function, without any functional transformations as shown in Figure 1, has maxima at x = 2*a*pi, minima at x = (2*b + 1)*pi, and roots at x = (c + 1/2)*pi for some integers a, b, and c. In other words, if x denotes the angle between two vectors, cos(x) returns the largest value when the vectors point in the same direction, the smallest value when the vectors point in opposite directions, and zero when the vectors are orthogonal to each other.
This behavior of the cosine function neatly captures the interplay between two key concepts in NLP: semantic overlap (conveying how much meaning is shared between two texts) and semantic polarity (capturing the oppositeness of meaning in texts). For example, the texts “I liked this movie” and “I enjoyed this film” would have high semantic overlap (they express essentially the same meaning despite using different words) and low semantic polarity (they do not express opposite meanings). Now, if the embedding vectors for two words happen to encode both semantic overlap and polarity, then we would expect synonyms to have cosine similarity approaching 1, antonyms to have cosine similarity approaching -1, and unrelated words to have cosine similarity approaching 0.
In practice, we will typically not know the angle x directly. Instead, we must derive the cosine value from the vectors themselves. Given two vectors U and V, each with n elements, the cosine of the angle between these vectors — equivalent to the cosine similarity metric — is computed as the dot product of the vectors divided by the product of the vector magnitudes:
The above formula for the cosine of the angle between two vectors can be derived from the so-called Cosine Rule, as demonstrated in the segment between minutes 12 and 18 of this video:
A neat proof of the Cosine Rule itself is presented in this video:
The following Python implementation of cosine similarity explicitly operationalizes the formulas presented above, without relying on any black-box, third-party packages:
import math
def cosine_similarity(U, V):
if len(U) != len(V):
raise ValueError("Vectors must be of the same length.")
# Compute dot product and magnitudes
dot_product = sum(u * v for u, v in zip(U, V))
magnitude_U = math.sqrt(sum(u ** 2 for u in U))
magnitude_V = math.sqrt(sum(v ** 2 for v in V))
# Zero vector handling to avoid division by zero
if magnitude_U == 0 or magnitude_V == 0:
raise ValueError("Cannot compute cosine similarity for zero-magnitude vectors.")
return dot_product / (magnitude_U * magnitude_V)
Interested readers can refer to this article for a more efficient Python implementation of the cosine distance metric (defined as 1 minus cosine similarity) using the NumPy and SciPy packages.
Finally, it is worth comparing the mathematical intuition of cosine similarity (or distance) with that of Euclidean distance, which measures the linear distance between two vectors and can also serve as a vector similarity metric. In particular, the lower the Euclidean distance between two vectors, the higher their semantic similarity is likely to be. The Euclidean distance between two vectors U and V (each of length n) can be computed using the following formula:
Below is the corresponding Python implementation:
import math
def euclidean_distance(U, V):
if len(U) != len(V):
raise ValueError("Vectors must be of the same length.")
# Compute sum of squared differences
sum_squared_diff = sum((u - v) ** 2 for u, v in zip(U, V))
# Take the square root of the sum
return math.sqrt(sum_squared_diff)
Notice that, since the elementwise differences in the Euclidean distance formula are squared, the resulting metric will always be a non-negative number — zero if the vectors are identical, positive otherwise. In the NLP context, this implies that Euclidean distance will not reflect semantic polarity in quite the same way as cosine distance does. Moreover, as long as two vectors point in the same direction, the cosine of the angle between them will remain the same regardless of the vector magnitudes. By contrast, the Euclidean distance metric is affected by differences in vector magnitude, which may lead to misleading interpretations in practice (e.g., two texts of different lengths may yield a high Euclidean distance despite being semantically similar). As such, cosine similarity is the preferred metric in many NLP scenarios, where determining vector — or semantic — directionality is the primary concern.
Theory versus Practice
In a practical NLP scenario, the interpretation of cosine similarity hinges on the extent to which the vector embedding encodes polarity as well as semantic overlap. In the following hands-on example, we will investigate the similarity between two given words using a pretrained embedding model that does not encode polarity (all-MiniLM-L6-v2) and one that does (distilbert-base-uncased-finetuned-sst-2-english). We will also use more efficient implementations of cosine similarity and Euclidean distance by leveraging functions provided by the SciPy package.
from scipy.spatial.distance import cosine as cosine_distance
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch
# Words to embed
words = ["movie", "film", "good", "bad", "spoon", "car"]
# Load a pre-trained embedding model from Hugging Face
model_1 = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model_2_name = "distilbert-base-uncased-finetuned-sst-2-english"
model_2_tokenizer = AutoTokenizer.from_pretrained(model_2_name)
model_2 = AutoModel.from_pretrained(model_2_name)
# Generate embeddings for model 1
embeddings_1 = dict(zip(words, model_1.encode(words)))
# Generate embeddings for model 2
inputs = model_2_tokenizer(words, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model_2(**inputs)
embedding_vectors_model_2 = outputs.last_hidden_state.mean(dim=1)
embeddings_2 = {word: vector for word, vector in zip(words, embedding_vectors_model_2)}
# Compute and print cosine similarity (1 - cosine distance) for both embedding models
print("Cosine similarity for embedding model 1:")
print("movie", "\t", "film", "\t", 1 - cosine_distance(embeddings_1["movie"], embeddings_1["film"]))
print("good", "\t", "bad", "\t", 1 - cosine_distance(embeddings_1["good"], embeddings_1["bad"]))
print("spoon", "\t", "car", "\t", 1 - cosine_distance(embeddings_1["spoon"], embeddings_1["car"]))
print()
print("Cosine similarity for embedding model 2:")
print("movie", "\t", "film", "\t", 1 - cosine_distance(embeddings_2["movie"], embeddings_2["film"]))
print("good", "\t", "bad", "\t", 1 - cosine_distance(embeddings_2["good"], embeddings_2["bad"]))
print("spoon", "\t", "car", "\t", 1 - cosine_distance(embeddings_2["spoon"], embeddings_2["car"]))
print()
Output:
Cosine similarity for embedding model 1:
movie film 0.8426464702276286
good bad 0.5871497042685934
spoon car 0.22919675707817078
Cosine similarity for embedding model 2:
movie film 0.9638281550070811
good bad -0.3416433451550165
spoon car 0.5418748837234599
The words “movie” and “film”, which are typically used as synonyms, have cosine similarity close to 1, suggesting high semantic overlap as expected. The words “good” and “bad” are antonyms, and we see this reflected in the negative cosine similarity result when using the second embedding model known to encode semantic polarity. Finally, the words “spoon” and “car” are semantically unrelated, and the corresponding orthogonality of their vector embeddings is indicated by their cosine similarity results being closer to zero than for “movie” and “film”.
The Wrap
The cosine similarity between two vectors is based on the cosine of the angle they form, and — unlike metrics such as Euclidean distance — is not sensitive to differences in vector magnitudes. In theory, cosine similarity should be close to 1 if the vectors point in the same direction (indicating high similarity), close to -1 if the vectors point in opposite directions (indicating high dissimilarity), and close to 0 if the vectors are orthogonal (indicating unrelatedness). However, the exact interpretation of cosine similarity in a given NLP scenario depends on the nature of the embedding model used to vectorize the textual data (e.g., whether the embedding model encodes polarity in addition to semantic overlap).
Source link
#Demystifying #Cosine #Similarity #Data #Science