Do Labels Make AI Blind? Self-Supervision Solves the Age-Old Binding Problem

paper from Konrad Körding’s Lab [1], “Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?” gives insights into a foundational question in visual neuroscience: what is required to bind visual elements and textures together as objects? The goal of this article is to give you a background on this problem, review this NeurIPS paper, and hopefully give you insight into both artificial and biological neural networks. I will also be reviewing some deep learning self-supervised learning methods and visual transformers, while highlighting the differences between current deep learning systems and our brains.

1. Introduction

When we view a scene, our visual system does not just hand our consciousness a high-level summary of the objects and composition; we also have conscious access to an entire visual hierarchy.

We can “grab” an object with our attention in the higher-level areas, like the Inferior Temporal (IT) cortex and Fusiform Face Area (FFA), and access all the contours and textures that are coded in the lower-level areas like V1 and V2.

If we lacked this capability to access our entire visual hierarchy, we would either not have conscious access to low-level details of the visual system, or the dimensionality would explode in the higher-level areas trying to convey all this information. This would require our brains to be substantially larger and consume more energy.

This distribution of information of the visual scene across the visual system means that the components or objects of the scene need to be bound together in some manner. For years, there have been two main factions on how this is done: one faction argued that object binding used neural oscillations (or more generally, synchrony) to bind object parts together, and the other faction argued that increases in neural firing were sufficient to bind the attended objects. My academic background puts me firmly in the latter camp, under the tutelage of Rüdiger von der Heydt, Ernst Niebur, and Pieter Roelfsema.

Von der Malsburg and Schneider proposed the neural oscillation binding hypothesis in 1986 (see [2] for review), where they proposed that each object had its own temporal tag.

In this framework, when you look at a picture with two puppies, all the neurons throughout the visual system encoding the first puppy would fire at one phase of the oscillation, while the neurons encoding the other puppy would fire at a different phase. Evidence for this type of binding was found in anesthetized cats, however, anesthesia increases oscillation in the brain.

In the firing rate framework, neurons encoding attended objects fired at a higher rate than those attending unattended objects and neurons encoding attended or unattended objects would fire at a higher rate than those encoding the background. This has been shown repeatedly and robustly in awake animals [3].

Initially, there were more experiments supporting the neural synchrony or oscillation hypotheses, but over time there has been more evidence for the increased firing rate binding hypothesis.

The focus of Li’s paper is whether deep learning models exhibit object binding. They convincingly argue that ViT networks trained by self-supervised learning naturally learn to bind objects, but those trained via supervised classification (ImageNet) do not. The failure of supervised training to teach object binding, in my opinion, suggests that there is a fundamental weakness to a single backpropagated global loss. Without carefully tuning this training paradigm, you have a system that takes shortcuts and (for example) learns textures instead of objects, as shown by Geirhos et al. [4]. As an end result, you get models that are fragile to adversarial attacks and only learn something when it has a significant impact on the final loss function. Fortunately, self-supervised learning works quite well as it stands without my more radical takes, and it is able to reliably learn object binding.

2. Methods

2.1. The Architecture: Vision Transformers (ViT)

I’m going to review the Vision Transformer (ViT; [5]) in this section, so feel free to skip if you don’t need to brush up on this architecture. After its introduction, there have been many additional visual transformer architectures, like the Swin transformer and various hybrid convolutional transformers, such as the CoAtNet and Convolutional Vision Transformer (CvT). However, the research community keeps coming back to ViT. Part of this is because ViT is well suited for current self-supervised approaches – such as Masked Auto-Encoding (MAE) and I-JEPA (Image Joint Embedding Predictive Architecture).

Figure 1. ViT Architecture Diagram — **Figure 1.** ViT architecture, shown performing classification. Created by author, photo with puppies by Nano Banana.

ViT splits the image into a grid of patches which are converted into tokens. Tokens in ViT are just feature vectors, while tokens in other transformers can be discrete. For Li’s paper, the authors resized the images to $224\times 224$ pixels and then split them into a grid of $16\times 16$ patches ($14\times 14$ pixels per patch). The patches are then converted to tokens by simply flattening the patches.

The positions of the patches in the image are added as positional embeddings using elementwise addition. For classification, the sequence of tokens is prepended with a special, learned classification token. So, if there are $W \times H$ patches, then there are $1 + W \times H$ input tokens. There are also $1 + W \times H$ output tokens from the core ViT model. The first token of the output sequence, which corresponds to the classification token, is passed to the classification head to produce the classification. All of the remaining output tokens are ignored for the classification task. Through training, the network learns to encode the global context of the image needed for classification into this token.

The tokens get passed through the encoder of the transformer while keeping the length of the sequence the same. There is an implied correspondence from the input token and the same token throughout the network. While there is no guarantee of what the tokens in the middle of the network will be encoding, this can be influenced by the training method. A dense task, like MAE, enforces this correspondence between the $i$-th token of the input sequence and the $i$-th token of the output sequence. A task with a coarse signal, like classification, might not teach the network to keep this correspondence.

2.2. The Training Regimes: Self-Supervised Learning (SSL)

You don’t necessarily need to know the details of the self-supervised learning methods used in the Li et al. NeurIPS 2025 paper to appreciate the results. They argue that the results applied to all the SSL methods they tried: DINO, MAE, and CLIP.

DINOv2 was the first SSL method the authors tested and the one that they focused on. DINO works by degrading the image with cropping and data augmentations. The basic idea is that the model learns to extract the important information from the degraded information and match that to the full original image. There is some complexity in that there is a teacher network, which is an exponential moving average (EMA) of the student network. This is less likely to collapse than if the student network is used to generate the training signal.

MAE is a type of Masked Image Modelling (MIM). It drops a certain percent of the tokens or patches from the input sequence. Since the tokens include positional encoding, this is easy to do. This reduced set of tokens is then passed through the encoder. The tokens are then passed through a transformer decoder to try to “inpaint” the missing tokens. The loss signal then comes from comparing the input with all the tokens (the ground-truth) with the predicted tokens.

CLIP relies on captioned images, such as those scraped from the web. It aligns a text encoder and image encoder, training them simultaneously. I won’t spend a lot of time describing it here, but one thing to point out is that this training signal is coarse (based on the whole image and the whole caption). The training data is web-scale, rather than limited to ImageNet, and while the signal is coarse, the feature vectors are not sparse (e.g. one-hot encoded). So, while it is considered self-supervised, it does use a weakly supervised signal in the form of the captions.

2.3. Probes

Figure 2. Two puppies with patches on different and same "objects" — **Figure 2.** Two puppies with patches on different and same “objects” (puppies). Created by author, image by Nano Banana.

As shown in Figure 2, a probe or test that is able to discriminate object binding needs to determine whether the blue patches are from the same puppy and the red and blue patches are from different puppies. So you might create a test like cosine similarity between the patches and find that this does pretty well in your test set. But… is it really detecting object binding and not low-level or class-based features? Most of the images probably aren’t as complex. So you need some probe that is like the cosine similarity test, but also some kind of strong baseline that is able to, for example, tell whether the patches belong to the same semantic class, but not necessarily whether they belong to the same instance.

The probes that they use that are most similar to using cosine similarity are the diagonal quadratic probe and the quadratic probe, where the latter essentially adds another linear layer (kind of like a linear probe, but you have two linear probes that you then take the dot product of). These are the two probes that I would consider have the potential to detect binding. They also have some object class-based probes that I would consider the strong baselines.

Figure 3. Graph of object binding accuracy at different layers — **Figure 3.** My simplified (poor) reproduction of the paper’s Figure 2. Results on models trained with DINOv2.

In their Figure 2 (my Figure 3), I would pay attention to the quadratic probe magenta curve and the overlapping object class orange curve. The quadratic curve doesn’t rise above the object class curves until around layers 10-11 of the 23 layers. The diagonal quadratic curve doesn’t ever reach above those curves (see original figure in paper), meaning that the binding information at least needs a linear layer to project it into an “IsSameObject” subspace.

I go into a little more detail with the probes in the appendix section, which I recommend skipping until/unless you read the paper.

3. The Central Claim: Li et al. (2025)

The main claim of their paper is that ViT models trained with self-supervised learning (SSL) naturally learn object binding, while ViT models trained with ImageNet supervised classification exhibit much weaker object binding. Overall, I find their arguments convincing, although, like with all papers, there are areas where they could have improved.

Their arguments are weakened by using the weak baseline of always guessing that two patches are not bound, as shown in Figure 2. Fortunately, they used a wide range of probes that includes stronger class-based baselines, and their quadratic probe still performs better than them. I do believe that it would be possible to create a better test and/or baselines, like adding positional awareness into the class-based methods. However, I think this is nitpicking and the object-based probes do make a pretty good baseline. Their Figure 4 gives additional reassurance that it is performing object binding, although probe distance could still be playing a role.

Their supervised ViT model only achieved 3.7% higher accuracy than the weak baseline, which I would interpret as not having any object binding. There is one complication to this result in that models trained with DINOv2 (and MAE) enforce a correspondence between the input tokens and output tokens, while the ImageNet classification only trains on the first token that corresponds to the learned “classify” task token; the remaining output tokens are ignored by this supervised training loss. So the probe is assuming that the $i$-th token at a given level corresponds to the $i$-th token of the input sequence, which is likely to hold truer for the DINOv2-trained models compared to the ImageNet-trained classification model.

I think it is an open question whether CLIP and MAE would have shown object binding if it was compared to a stronger baseline. Figure 7 in their Appendix doesn’t make CLIP’s binding signal look that strong. Although CLIP, like supervised classification training, doesn’t enforce the token correspondence throughout the processing. Notably in both supervised learning and CLIP, the layer with the peak accuracy on same-object prediction is earlier in the network (0.13 and 0.39 out of 1), while networks that preserve the token correspondence show a peak later in the networks (0.65-1 out of 1).

Going back to mushy biological brains, one of the reasons why binding is an issue is that the representation of an object is distributed across the visual hierarchy. The ViT architecture is fundamentally different in that there is no bidirectionality of information; all the information flows in a single direction and the representation at lower levels is no longer needed once its information is passed on. Appendix A3 does show that the quadratic probe has a relatively high accuracy for estimating whether patches from layer 15 and 18 are bound, so it seems that this information is at least there, even if it isn’t a bidirectional, recurrent architecture.

4. Conclusion: A New Baseline for “Understanding”?

I think this paper is really quite cool, as it’s the first paper that I’m aware of that shows evidence of a deep learning model showing the emergent property of object binding. It would be great if the results of the other SSL methods, like MAE, could be shown with the stronger baselines, but this paper at least shows strong evidence that ViTs trained with DINO exhibit object binding. Previous work has suggested that this was not the case. The weakness (or absence) of the object binding signal from ViTs trained on ImageNet classification is also interesting, and it is consistent with the papers that suggest that CNNs trained with ImageNet classification are biased towards texture instead of object shape [4], although ViTs have less texture bias [6] and DINO self-supervision also reduces the texture bias (but possibly not MAE) [7].

There are always things that can be improved with papers, and that’s why science and research builds on past research and expands and tests previous findings. Discriminating object-binding from other features is difficult and might require tests like artificial geometric stimuli to prove for certain that object-binding was found without any doubt. However, the evidence presented is still quite strong.

Even if you are not interested in object-binding per se, the difference in behavior between ViT trained by unsupervised and supervised approaches is rather stark and gives us some insights into the training regimes. It suggests that the foundation models that we are building are learning in a way that is more similar to the gold standard of real intelligence: humans.

Appendix

Probe Details

I’m adding this section as an appendix because it might be useful if you are going into the paper in more detail. However, I suspect it will be too much detail for most people reading this post. One approach to determine whether two tokens are bound might be to calculate the cosine similarity of those tokens. This is simply taking the dot-product of the L2-normalized vector tokens. Unfortunately, in my opinion, they didn’t try to take the L2-normalization of the vector tokens, but they did try a weighted dot product which they call the diagonal quadratic probe.

$$\phi_\text{diag} (x,y) = x ^ \top\mathrm{diag} (w) y$$

The weights $ w $ are learned, so the probe can learn to focus on the dimensions more relevant to binding. While they didn’t perform L2-normalization, they did apply layer-normalization to the tokens, which includes L1-normalization and whitening per token.

There is no reason to believe that the object binding property would be nicely segregated in the feature vectors in their current forms, so it would make sense to first project them into a new “IsSameObject” subspace and then take their dot product. This is the quadratic probe that they found works so well:

$$\begin{align}
\phi_\text{quad} (x,y) &= W x \cdot W y \\
&= \left( W x \right) ^ \top W y \\\
&= x ^\top W ^\top W y
\end{align}
$$
where $W \in \mathbb R ^{k \times d}, k \ll d$.

The quadratic probe is much better at extracting the binding than the diagonal quadratic probe. In fact, I would argue that the quadratic probe is the only probe that they show that can extract the information on whether the objects are bound or not, since it is the only one that exceed the strong baseline of the object class-based probes.

I skipped over their linear probe, which is a probe that I feel that they had to include in the paper, but that doesn’t really make any sense. For this, they applied a linear probe (an additional layer that they train separately) to both the tokens, and then add the results. The addition is why I think the probe is a distraction. To compare the tokens, there needs to be a multiplication. The quadratic probe is a better equivalent to the linear probe when you are comparing two feature vectors.

Bibliography

[1] Y. Li, S. Salehi, L. Ungar and K. P. Kording, Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? (2025), arXiv preprint arXiv:2510.24709

[2] P. R. Roelfsema, Solving the binding problem: Assemblies form when neurons enhance their firing rate—they don’t need to oscillate or synchronize (2023), Neuron, 111(7), 1003-1019

[3] J. R. Williford and R. von der Heydt, Border-ownership coding (2013), Scholarpedia journal, 8(10), 30040

[4] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann and W. Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (2018), International Conference on Learning Representations

[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16×16 words: Transformers for image recognition at scale (2020), arXiv preprint arXiv:2010.11929

[6] M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. Shahbaz Khan and M. H. Yang, Intriguing properties of vision transformers (2021), Advances in Neural Information Processing Systems, 34, 23296-23308

[7] N. Park, W. Kim, B. Heo, T. Kim and S. Yun, What do self-supervised vision transformers learn? (2023), arXiv preprint arXiv:2305.00729

Source link

#Labels #Blind #SelfSupervision #Solves #AgeOld #Binding #Problem