...

Sparse AutoEncoder: from Superposition to interpretable features | by Shuyang Xiang | Feb, 2025


Disentangle features in complex Neural Network with superpositions

Complex neural networks, such as Large Language Models (LLMs), suffer quite often from interpretability challenges. One of the most important reasons for such difficulty is superposition — a phenomenon of the neural network having fewer dimensions than the number of features it has to represent. For example, a toy LLM with 2 neurons has to present 6 different language features. As a result, we observe often that a single neuron needs to activate for multiple features. For a more detailed explanation and definition of superposition, please refer to my previous blog post: “Superposition: What Makes it Difficult to Explain Neural Network”.

In this blog post, we take one step further: let’s try to disentangle some fsuperposed features. I will introduce a methodology called Sparse Autoencoder to decompose complex neural network, especially LLM into interpretable features, with a toy example of language features.

A Sparse Autoencoder, by definition, is an Autoencoder with sparsity introduced on purpose in the activations of its hidden layers. With a rather simple structure and light training process, it aims to decompose a complex neural network and uncover the features in a more interpretable way and more understandable to humans.

Let us imagine that you have a trained neural network. The autoencoder is not part of the training process of the model itself but is instead a post-hoc analysis tool. The original model has its own activations, and these activations are collected afterwards and then used as input data for the sparse autoencoder.

For example, we suppose that your original model is a neural network with one hidden layer of 5 neurons. Besides, you have a training dataset of 5000 samples. You have to collect all the values of the 5-dimensional activation of the hidden layer for all your 5000 training samples, and they are now the input for your sparse autoencoder.

Image by author: Autoencoder to analyse an LLM

The autoencoder then learns a new, sparse representation from these activations. The encoder maps the original MLP activations into a new vector space with higher representation dimensions. Looking back at my previous 5-neuron simple example, we might consider to map it into a vector space with 20 features. Hopefully, we will obtain a sparse autoencoder effectively decomposing the original MLP activations into a representation, easier to interpret and analyze.

Sparsity is an important in the autoencoder because it is necessary for the autoencoder to “disentangle” features, with more “freedom” than in a dense, overlapping space.. Without existence of sparsity, the autoencoder will probably the autoencoder might just learn a trivial compression without any meaningful features’ formation.

Language model

Let us now build our toy model. I beg the readers to note that this model is not realistic and even a bit silly in practice but it is sufficient to showcase how we build sparse autoencoder and capture some features.

Suppose now we have built a language model which has one particular hidden layer whose activation has three dimensions. Let us suppose also that we have the following tokens: “cat,” “happy cat,” “dog,” “energetic dog,” “not cat,” “not dog,” “robot,” and “AI assistant” in the training dataset and they have the following activation values.

data = torch.tensor([
# Cat categories
[0.8, 0.3, 0.1, 0.05], # "cat"
[0.82, 0.32, 0.12, 0.06], # "happy cat" (similar to "cat")
# Dog categories
[0.7, 0.2, 0.05, 0.2], # "dog"
[0.75, 0.3, 0.1, 0.25], # "loyal dog" (similar to "dog")

# "Not animal" categories
[0.05, 0.9, 0.4, 0.4], # "not cat"
[0.15, 0.85, 0.35, 0.5], # "not dog"

# Robot and AI assistant (more distinct in 4D space)
[0.0, 0.7, 0.9, 0.8], # "robot"
[0.1, 0.6, 0.85, 0.75] # "AI assistant"
], dtype=torch.float32)

Construction of autoencoder

We now build the autoencoder with the following code:

class SparseAutoencoder(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(SparseAutoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU()
)
self.decoder = nn.Sequential(
nn.Linear(hidden_dim, input_dim)
)

def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return encoded, decoded

According to the code above, we see that the encoder has a only one fully connected linear layer, mapping the input to a hidden representation with hidden_dim and it then passes to a ReLU activation. The decoder uses just one linear layer to reconstruct the input. Note that the absence of ReLU activation in the decoder is intentional for our specific reconstruction case, because the reconstruction might contain real-valued and potentially negative valued data. A ReLU would on the contrary force the output to stay non-negative, which is not desirable for our reconstruction.

We train model using the code below. Here, the loss function has two parts: the reconstruction loss, measuring the accuracy of the autoencoder’s reconstruction of the input data, and a sparsity loss (with weight), which encourages sparsity formulation in the encoder.

# Training loop
for epoch in range(num_epochs):
optimizer.zero_grad()

# Forward pass
encoded, decoded = model(data)

# Reconstruction loss
reconstruction_loss = criterion(decoded, data)

# Sparsity penalty (L1 regularization on the encoded features)
sparsity_loss = torch.mean(torch.abs(encoded))

# Total loss
loss = reconstruction_loss + sparsity_weight * sparsity_loss

# Backward pass and optimization
loss.backward()
optimizer.step()

Now we can have a look of the result. We have plotted the encoder’s output value of each activation of the original models. Recall that the input tokens are “cat,” “happy cat,” “dog,” “energetic dog,” “not cat,” “not dog,” “robot,” and “AI assistant”.

Image by author: features learned by encoder

Even though the original model was designed with a very simple architecture without any deep consideration, the autoencoder has still captured meaningful features of this trivial model. According to the plot above, we can observe at least four features that appear to be learned by the encoder.

Give first Feature 1 a consideration. This feautre has big activation values on the 4 following tokens: “cat”, “happy cat”, “dog”, and “energetic dog”. The result suggests that Feature 1 can be something related to “animals” or “pets”. Feature 2 is also an interesting example, activating on two tokens “robot” and “AI assistant”. We guess, therefore, this feature has something to do with “artificial and robotics”, indicating the model’s understanding on technological contexts. Feature 3 has activation on 4 tokens: “not cat”, “not dog”, “robot” and “AI assistant” and this is possibly a feature “not an animal”.

Unfortunately, original model is not a real model trained on real-world text, but rather artificially designed with the assumption that similar tokens have some similarity in the activation vector space. However, the results still provide interesting insights: the sparse autoencoder succeeded in showing some meaningful, human-friendly features or real-world concepts.

The simple result in this blog post suggests:, a sparse autoencoder can effectively help to get high-level, interpretable features from complex neural networks such as LLM.

For readers interested in a real-world implementation of sparse autoencoders, I recommend this article, where an autoencoder was trained to interpret a real large language model with 512 neurons. This study provides a real application of sparse autoencoders in the context of LLM’s interpretability.

Finally, I provide here this google colab notebook for my detailed implementation mentioned in this article.

Source link

#Sparse #AutoEncoder #Superposition #interpretable #features #Shuyang #Xiang #Feb