...

Understanding Convolutional Neural Networks (CNNs) Through Excel


as a black box. We know that it learns from data, but the question is how it truly learns.

In this article, we will build a tiny Convolutional Neural Network (CNN) directly in Excel to understand, step by step, how a CNN actually works for images.

We will open this black box, and watch each step happen right before our eyes. We will understand all the calculations that are the foundation of what we call “deep learning”.

This article is in a series of articles about implementing machine learning and deep learning algorithms in Excel. And you can find all the Excel files in this Kofi link.

1. How Images are Seen by Machines

1.1 Two Ways to Detect Something in an Image

When we try to detect an object in a picture, like a cat, there are two main ways: the deterministic approach and the machine learning approach. Let’s see how these two approaches work for this example of recognizing a cat in a picture.

The deterministic way means writing rules by hand.

For example, we can say that a cat has a round face, two triangle ears, a body, a tail, etc. So the developer will do all the work to define the rules.

Then the computer runs all these rules, and gives a score of similarity.

Deterministic approach to detect a cat on a picture — image by author

The machine learning approach means that we do not write rules by ourselves.

Instead, we give the computer many examples, pictures with cats and pictures without cats. Then it learns by itself what makes a cat a cat.

Machine learning approach to detect a cat on a picture — image by author (cats are generated by AI)

That is where things may become mysterious.

We usually say that the machine will figure it out by itself, but the real question is how.

In fact, we still have to tell the machines how to create these rules. And rules should be learnable. So the key point is: how can we define the kind of rules that will be used?

To understand how to define rules, we first have to understand what an image is.

1.2 Understanding What an Image Is

A cat is complex form, but we can take a simple and clear example: recognizing handwritten digits from the MNIST dataset.

First, what is an image?

A digital image can be seen as a grid of pixels. Each pixel is a number that shows how bright it is, from 0 for white to 255 for black.

In Excel, we can represent this grid with a table where each cell corresponds to one pixel.

MNIST Handwritten digits – image from the MNIST dataset https://en.wikipedia.org/wiki/MNIST_database (CC BY-SA 3.0)

The original dimension of the digits is 28 x 28. But to keep things simple, we will use a 10×10 table. It is small enough for quick calculations but still large enough to show the general shape.

So we will reduce the dimension.

For example, the handwritten number “1” can be represented by a 10×10 grid as below in Excel.

Image is a grid of numbers — image by author

1.3 Before Deep Learning: Classic Machine Learning for Images

Before using CNNs or any deep learning method, we can already recognize simple images with classic machine learning algorithms such as logistic regression or decision trees.

In this approach, each pixel becomes one feature. For example, a 10×10 image has 100 pixels, so there are 100 features as input.

The algorithm then learns to associate patterns of pixel values with labels such as “0”, “1”, or “2”.

Classic ML for image recognition — image by author

In fact with this simple machine learning approach, logistic regression can achieve quite good results with an accuracy around 90%.

This shows that classic models are able to learn useful information from raw pixel values.

However, they have a major limitation. They treat each pixel as an independent value, without considering its neighbors. As a result, they cannot understand spatial relationships with the pixels.

So intuitively, we know that the performance will not be good for complex images. So this method is not scalable.

Now, if you already know how classic machine learning works, you know that there is no magic. And in fact, you already know what to do: you have to improve the feature engineering step, you have to transform the features, in order to get more meaningful information from the pixels.

2. Building a CNN Step by Step in Excel

2.1 From complex CNNs to a simple one in Excel

When we talk about Convolutional Neural Networks, we often see very deep and complex architectures, like VGG-16. Many layers, thousands of parameters, and countless operations, it seems very complex, and say that it is impossible to understand exactly how it works.

VGG16 architecture — image by author

The main idea behind the layers is: detecting patterns step by step.

With the example of handwritten digits, let’s ask a question: what could be the simplest possible CNN architecture?

First, for the hidden layers, before doing all the layers, let’s reduce the number. How many? Let’s do one. That’s right: only one.

As for the filters, what about their dimensions? In real CNN layers, we usually use 3×3 filters to detect small pattern. But let’s begin with big ones.

How big? 10×10!

Yes, why not?

This also means that you don’t have to slide the filter across the image. This way, we can directly compare the input image with the filter and see how well they match.

This simple case is not about performance, but about clarity.
It will show how CNNs detect patterns step by step.

Now, we have to define the number of filters. We will say 10, it is the minimum. Why? Because there are 10 digits, so we have to have a minimum of 10 filters. And we will see how they can be found in the next section.

In the image below, you have the diagram of this simplest architecture of a CNN neural network:

The simplest CNN architecture – image by author

2.2 Training the Filters (or Designing Them Ourselves)

In a real CNN, the filters are not written by hand. They are learned during training.

The neural network adjusts the values inside each filter to detect the patterns that best help to recognize the images.

In our simple Excel example, we will not train the filters.

Instead, we will create them ourselves to understand what they represent.

Since we already know the shapes of handwritten digits, we can design filters that look like each digit.

For example, we can draw a filter that matches the form of 0, another for 1, and so on.

Another option is to take the average image of all examples for each digit and use that as the filter.

Each filter will then represent the “average shape” of a number.

This is where the frontier between human and machine becomes visible again. We can either let the machine discover the filters, or we can use our own knowledge to build them manually.

That is right: machines do not define the nature of the operations. Machine learning researchers define them. Machines are only good to do loops, to find the optimal values for these defines rules. And in simple cases, humans are always better than machines.

So, if there are only 10 filters to define, we know that we can directly define the 10 digits. So we know, intuitively, the nature of these filters. But there are other options, of course.

Now, to define the numerical values of these filters, we can directly use our knowledge. And we also can use the training dataset.

Below you can see the 10 filters created by averaging all the images of each handwritten digit. Each one shows the typical pattern that defines a number.

Average values as filters — image by author

2.3 How a CNN Detects Patterns

Now that we have the filters, we have to compare the input image to these filters.

The central operation in a CNN is called cross-correlation. It is the key mechanism that allows the computer to match patterns in an image.

It works in two simple steps:

  1. Multiply values/dot product: we take each pixel in the input image, and we will multiply it by the pixel in the same position of the filter. This means that the filter “looks” at each pixel of the image and measures how similar it is to the pattern stored in the filter. Yes, if the two values are large, then the result is large.
  2. Add results/sum: The products of these multiplications are then added together to produce a single number. This number expresses how strongly the input image matches the filter.
Example of Cross Correlation for one picture – image by author

In our simplified architecture, the filter has the same size as the input image (10×10).

Because of this, the filter does not need to move across the image.
Instead, the cross-correlation is applied once, comparing the whole image with the filter directly.

This number represents how well the image matches the pattern inside the filter.

If the filter looks like the average shape of a handwritten “5”, a high value means that the image is probably a “5”.

By repeating this operation with all filters, one per digit, we can see which pattern gives the highest match.

2.4 Building a Simple CNN in Excel

We can now create a small CNN from end to end to see how the full process works in practice.

  1. Input: A 10×10 matrix represents the image to classify.
  2. Filters: We define ten filters of size 10×10, each one representing the average image of a handwritten digit from 0 to 9. These filters act as pattern detectors for each number.
  3. Cross correlation: Each filter is applied to the input image, producing a single score that measures how well the image matches that filter’s pattern.
  4. Decision: The filter with the highest score gives the predicted digit. In deep learning frameworks, this step is often handled by a Softmax function, which converts all scores into probabilities.
    In our simple Excel version, taking the maximum score is enough to determine which digit the image most likely represents.
Each 10×10 filter represents the average shape of a handwritten digit (0–9).
The input image is compared with all filters using cross-correlation.
The filter that produces the highest score — after normalization with Softmax — corresponds to the detected digit.
Cross-correlation of the input digit with ten average digit filters. The highest score, normalized by Softmax, identifies the input as “6.” – image by author

2.5 Convolution or Cross Correlation?

At this point, you might wonder why we call it a Convolutional Neural Network when the operation we described is actually cross-correlation.

The difference is subtle but simple:

  • Convolution means flipping the filter both horizontally and vertically before sliding it over the image.
  • Cross-correlation means applying the filter directly, without flipping.

For more information, you can read this article:

For some historical reason, the term Convolution stayed, whereas the operation that is actually done in a CNN is cross-correlation.

As you can see, in most deep-learning frameworks, such as PyTorch or TensorFlow, actually use cross-correlation when performing “convolutions”.

Cross correlation and convolution — image by author

In short:

CNNs are “convolutional” in name, but “cross-correlational” in practice.

3. Building More Complex Architectures

3.1 Small filters to detect more detailed patterns

In the previous example, we used a single 10×10 filter to compare the whole image with one pattern.

This was enough to understand the principle of cross-correlation and how a CNN detects similarity between an image and a filter.

Now we can take one step further.

Instead of one global filter, we will use several smaller filters, each of size 5×5. These filters will look at smaller regions of the image, detecting local details instead of the entire shape.

Let’s take an example with four 5×5 filters applied to a handwritten digit.

The input image can be cut into 4 smaller parts of 5×5 pixels for each one.

We still can use the average value of all the digits to begin with. So each filter will give 4 values, instead of one.

Smaller filters in CNN for digits recognition – image by author

At the end, we can apply a Softmax function to get the final prediction.

But in this simple case, it is also possible just to sum all the values.

3.2 What if the digit is not in the center of the image

In my previous examples, I compare the filters to fixed areas of the image. And one intuitive question that we can ask is what if the object is not centered. Yes, it can be at any position on a image.

The solution is unfortunately very basic: you slide the filter across the image.

Let’s take a simple example again: the dimension of the input image is 10×14. The height is not changed, and the width is 14.

So the filter is still 10 x 10, and it will slide horizontally across the image. Then, we will get 5 cross-correlation.

We do not know where the image is, but it is not a problem because we can just get the max value of the 5 the-cross correlations.

This is what we call max pooling layer.

Max pooling in a simple CNN – Image by author

3.3 Other Operations Used in CNNs

We try to explain, why each component is useful in a CNN.

The most important component is the cross-correlation between the input and the filters. And we also explain that small filters can be useful, and how max pooling handles objects that can be anywhere in an image.

There are also other steps commonly used in CNNs, such as using several layers in a row or applying non-linear activation functions.

These steps make the model more flexible, more robust, and able to learn richer patterns.

Why are they useful exactly?

I will leave this question to you as an exercise.

Now that you understand the core idea, try to think about how each of these steps helps a CNN go further, and you can try to think about some concrete example in Excel.

Conclusion

Simulating a CNN in Excel is a fun and practical way to see how machines recognize images.

By working with small matrices and simple filters, we can understand the main steps of a CNN.

I hope this article gave you some food for thought about what deep learning really is. The difference between machine learning and deep learning is not only about how deep the model is, but about how it works with representations of images and data.

Source link

#Understanding #Convolutional #Neural #Networks #CNNs #Excel