Demystifying Cosine Similarity | Towards Data Science

is a commonly used metric for operationalizing tasks such as semantic search and document comparison in the field of natural language processing (NLP). Introductory NLP courses often provide only a high-level justification for using cosine similarity in such tasks (as opposed to, say, Euclidean distance) without explaining the underlying mathematics, leaving many data scientists with a rather vague understanding of the subject matter. To address this gap, the following article lays out the mathematical intuition behind the cosine similarity metric and shows how this can help us interpret results in practice with hands-on examples in Python.

Note: All figures and formulas in the following sections have been created by the author of this article.

Mathematical Intuition

The cosine similarity metric is based on the cosine function that readers may recall from high school math. The cosine function exhibits a repeating wavelike pattern, a full cycle of which is depicted in Figure 1 below for the range 0 x pi. The Python code used to produce the figure is also included for reference.

import numpy as np
import matplotlib.pyplot as plt

# Define the x range from 0 to 2*pi
x = np.linspace(0, 2 * np.pi, 500)
y = np.cos(x)

# Create the plot
plt.figure(figsize=(8, 4))
plt.plot(x, y, label='cos(x)', color='blue')

# Add notches on the x-axis at pi/2 and 3*pi/2
notch_positions = [0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi]
notch_labels = ['0', 'pi/2', 'pi', '3*pi/2', '2*pi']
plt.xticks(ticks=notch_positions, labels=notch_labels)

# Add custom horizontal gridlines only at y = -1, 0, 1
for y_val in [-1, 0, 1]:
    plt.axhline(y=y_val, color='gray', linestyle='--', linewidth=0.5)

# Add vertical gridlines at specified x-values
for x_val in notch_positions:
    plt.axvline(x=x_val, color='gray', linestyle='--', linewidth=0.5)

# Customize the plot
plt.xlabel("x")
plt.ylabel("cos(x)")

# Final layout and display
plt.tight_layout()
plt.show()

The function parameter x denotes an angle in radians (e.g., the angle between two vectors in an embedding space), where pi/2, pi, 3*pi/2, and 2*pi, are 90, 180, 270, and 360 degrees, respectively.

To understand why the cosine function can serve as a useful basis for designing a vector similarity metric, notice that the basic cosine function, without any functional transformations as shown in Figure 1, has maxima at x = 2*a*pi, minima at x = (2*b + 1)*pi, and roots at x = (c + 1/2)*pi for some integers a, b, and c. In other words, if x denotes the angle between two vectors, cos(x) returns the largest value when the vectors point in the same direction, the smallest value when the vectors point in opposite directions, and zero when the vectors are orthogonal to each other.

This behavior of the cosine function neatly captures the interplay between two key concepts in NLP: semantic overlap (conveying how much meaning is shared between two texts) and semantic polarity (capturing the oppositeness of meaning in texts). For example, the texts “I liked this movie” and “I enjoyed this film” would have high semantic overlap (they express essentially the same meaning despite using different words) and low semantic polarity (they do not express opposite meanings). Now, if the embedding vectors for two words happen to encode both semantic overlap and polarity, then we would expect synonyms to have cosine similarity approaching 1, antonyms to have cosine similarity approaching -1, and unrelated words to have cosine similarity approaching 0.

In practice, we will typically not know the angle x directly. Instead, we must derive the cosine value from the vectors themselves. Given two vectors U and V, each with n elements, the cosine of the angle between these vectors — equivalent to the cosine similarity metric — is computed as the dot product of the vectors divided by the product of the vector magnitudes:

The above formula for the cosine of the angle between two vectors can be derived from the so-called Cosine Rule, as demonstrated in the segment between minutes 12 and 18 of this video:

A neat proof of the Cosine Rule itself is presented in this video:

The following Python implementation of cosine similarity explicitly operationalizes the formulas presented above, without relying on any black-box, third-party packages:

import math

def cosine_similarity(U, V):
    if len(U) != len(V):
        raise ValueError("Vectors must be of the same length.")

    # Compute dot product and magnitudes
    dot_product = sum(u * v for u, v in zip(U, V))
    magnitude_U = math.sqrt(sum(u ** 2 for u in U))
    magnitude_V = math.sqrt(sum(v ** 2 for v in V))
    
    # Zero vector handling to avoid division by zero
    if magnitude_U == 0 or magnitude_V == 0:
        raise ValueError("Cannot compute cosine similarity for zero-magnitude vectors.")

    return dot_product / (magnitude_U * magnitude_V)

Interested readers can refer to this article for a more efficient Python implementation of the cosine distance metric (defined as 1 minus cosine similarity) using the NumPy and SciPy packages.

Finally, it is worth comparing the mathematical intuition of cosine similarity (or distance) with that of Euclidean distance, which measures the linear distance between two vectors and can also serve as a vector similarity metric. In particular, the lower the Euclidean distance between two vectors, the higher their semantic similarity is likely to be. The Euclidean distance between two vectors U and V (each of length n) can be computed using the following formula:

Below is the corresponding Python implementation:

import math

def euclidean_distance(U, V):
    if len(U) != len(V):
        raise ValueError("Vectors must be of the same length.")

    # Compute sum of squared differences
    sum_squared_diff = sum((u - v) ** 2 for u, v in zip(U, V))

    # Take the square root of the sum
    return math.sqrt(sum_squared_diff)

Notice that, since the elementwise differences in the Euclidean distance formula are squared, the resulting metric will always be a non-negative number — zero if the vectors are identical, positive otherwise. In the NLP context, this implies that Euclidean distance will not reflect semantic polarity in quite the same way as cosine distance does. Moreover, as long as two vectors point in the same direction, the cosine of the angle between them will remain the same regardless of the vector magnitudes. By contrast, the Euclidean distance metric is affected by differences in vector magnitude, which may lead to misleading interpretations in practice (e.g., two texts of different lengths may yield a high Euclidean distance despite being semantically similar). As such, cosine similarity is the preferred metric in many NLP scenarios, where determining vector — or semantic — directionality is the primary concern.

Theory versus Practice

In a practical NLP scenario, the interpretation of cosine similarity hinges on the extent to which the vector embedding encodes polarity as well as semantic overlap. In the following hands-on example, we will investigate the similarity between two given words using a pretrained embedding model that does not encode polarity (all-MiniLM-L6-v2) and one that does (distilbert-base-uncased-finetuned-sst-2-english). We will also use more efficient implementations of cosine similarity and Euclidean distance by leveraging functions provided by the SciPy package.

from scipy.spatial.distance import cosine as cosine_distance
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch

# Words to embed
words = ["movie", "film", "good", "bad", "spoon", "car"]

# Load a pre-trained embedding model from Hugging Face
model_1 = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model_2_name = "distilbert-base-uncased-finetuned-sst-2-english"
model_2_tokenizer = AutoTokenizer.from_pretrained(model_2_name)
model_2 = AutoModel.from_pretrained(model_2_name)

# Generate embeddings for model 1
embeddings_1 =  dict(zip(words, model_1.encode(words)))

# Generate embeddings for model 2
inputs = model_2_tokenizer(words, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model_2(**inputs)
    embedding_vectors_model_2 = outputs.last_hidden_state.mean(dim=1)
embeddings_2 = {word: vector for word, vector in zip(words, embedding_vectors_model_2)}

# Compute and print cosine similarity (1 - cosine distance) for both embedding models
print("Cosine similarity for embedding model 1:")
print("movie", "\t", "film", "\t", 1 - cosine_distance(embeddings_1["movie"], embeddings_1["film"]))
print("good", "\t", "bad", "\t", 1 - cosine_distance(embeddings_1["good"], embeddings_1["bad"]))
print("spoon", "\t", "car", "\t", 1 - cosine_distance(embeddings_1["spoon"], embeddings_1["car"]))
print()

print("Cosine similarity for embedding model 2:")
print("movie", "\t", "film", "\t", 1 - cosine_distance(embeddings_2["movie"], embeddings_2["film"]))
print("good", "\t", "bad", "\t", 1 - cosine_distance(embeddings_2["good"], embeddings_2["bad"]))
print("spoon", "\t", "car", "\t", 1 - cosine_distance(embeddings_2["spoon"], embeddings_2["car"]))
print()

Output:

Cosine similarity for embedding model 1:
movie 	 film 	 0.8426464702276286
good 	 bad 	 0.5871497042685934
spoon 	 car 	 0.22919675707817078

Cosine similarity for embedding model 2:
movie 	 film 	 0.9638281550070811
good 	 bad 	 -0.3416433451550165
spoon 	 car 	 0.5418748837234599

The words “movie” and “film”, which are typically used as synonyms, have cosine similarity close to 1, suggesting high semantic overlap as expected. The words “good” and “bad” are antonyms, and we see this reflected in the negative cosine similarity result when using the second embedding model known to encode semantic polarity. Finally, the words “spoon” and “car” are semantically unrelated, and the corresponding orthogonality of their vector embeddings is indicated by their cosine similarity results being closer to zero than for “movie” and “film”.

The Wrap

The cosine similarity between two vectors is based on the cosine of the angle they form, and — unlike metrics such as Euclidean distance — is not sensitive to differences in vector magnitudes. In theory, cosine similarity should be close to 1 if the vectors point in the same direction (indicating high similarity), close to -1 if the vectors point in opposite directions (indicating high dissimilarity), and close to 0 if the vectors are orthogonal (indicating unrelatedness). However, the exact interpretation of cosine similarity in a given NLP scenario depends on the nature of the embedding model used to vectorize the textual data (e.g., whether the embedding model encodes polarity in addition to semantic overlap).

Source link

#Demystifying #Cosine #Similarity #Data #Science

Demystifying Cosine Similarity | Towards Data Science

Mathematical Intuition

Theory versus Practice

The Wrap

Recent Posts

CME Group suffers hours-long outage

Robot Talk Episode 135 – Robot anatomy and design, with Chapa Sirithunge

What we still don’t know about weight-loss drugs

Data Science in 2026: Is It Still Worth It?

Before a Soyuz launch Thursday someone forgot to secure a 20-ton service platform

The Best Black Friday Ninja Deals of 2025: Slushi, Crispi, more

The Download: the mysteries surrounding weight-loss drugs, and the economic effects of AI

Bose, Sony, and Apple headphones are cheaper than ever for Black Friday

Best Vacuum Cleaner Black Friday Deals (2025): Dyson, Bissell, Eufy

OpenAI’s Financial Situation Will Cause a Nauseating Sensation in the Pit of Your Stomach