The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel

Neural Network Regressor, we now move to the classifier version.

From a mathematical point of view, the two models are very similar. In fact, they differ mainly by the interpretation of the output and the choice of the loss function.

However, this classifier version is where intuition usually becomes much stronger.

In practice, neural networks are used far more often for classification than for regression. Thinking in terms of probabilities, decision boundaries, and classes makes the role of neurons and layers easier to grasp.

In this article, you will see:

how to define the structure of a neural network in an intuitive way,
why the number of neurons matters,
and why a single hidden layer is already sufficient, at least in theory.

At this point, a natural question arises:
If one hidden layer is enough, why do we talk so much about deep learning?

The answer is important.

Deep learning is not just about stacking many hidden layers on top of each other. Depth helps, but it is not the whole story. What really matters is how representations are built, reused, and constrained, and why deeper architectures are more efficient to train and generalize in practice.

We will come back to this distinction later. For now, we deliberately keep the network small, so that every computation can be understood, written, and checked by hand.

This is the best way to truly understand how a neural network classifier works.

As with the neural network regressor we built yesterday, we will split the work into two parts.

First, we look at forward propagation and define the neural network as a fixed mathematical function that maps inputs to predicted probabilities.

Then, we move to backpropagation, where we train this function by minimizing the log loss using gradient descent.

The principles are exactly the same as before. Only the interpretation of the output and the loss function change.

1. Forward propagation

In this section, we focus on only one thing: the model itself. No training yet. Just the function.

1.1 A simple dataset and the intuition of building a function

We start with a very small dataset:

12 observations
One single feature x
A binary target y

The dataset is intentionally simple so that every computation can be followed manually. However, it has one important property: the classes are not linearly separable.

This means that a simple logistic regression cannot solve the problem, regardless of how well it is trained.

Dataset for Neural Network Classifier – all images by author

However, the intuition is precisely the opposite of what it may seem at first.

What we are going to do is build two logistic regressions first. Each one creates a cut in the input space, as illustrated below.

In other words, we start with one single feature, and we transform it into two new features.

Neural Network Classifier – all images by author

Then, we apply another logistic regression, this time on these two features, to obtain the final output probability.

When written as a single mathematical expression, the resulting function is already a bit complex to read. This is exactly why we use a diagram: not because the diagram is more accurate, but because it is easier to understand how the function is built by composition.

1.2 Neural Network Structure

So the visual diagram represents the following model:

One hidden layer with two neurons in the hidden layer, which allows us to represent the two cuts we observe in the dataset
One output neuron, and it is a logistic regression here.

In our case, the model depends on seven coefficients:

Weights and biases for the two hidden neurons
Weights and bias for the output neuron

Taken together, these seven numbers fully define the model.

Now, if you already understand how a neural network classifier works, here is a question for you:

How many different solutions can this model have?

In other words, how many distinct sets of seven coefficients can produce the same classification boundary, or almost the same predicted probabilities, on this dataset?

1.3 Implementing forward propagation in Excel

We now implement the model using Excel formulas.

To visualize the output of the neural network, we generate new values of x ranging from −2 to 2 with a step of 0.02.

For each value of x, we compute:

The outputs of the two hidden neurons (A1 and A2)
The final output of the network

At this stage, the model is not trained yet. We therefore need to fix the seven parameters of the network. For now, we simply use a set of reasonable values, shown below, which allows us to visualize the forward propagation of the model.

It is just one possible configuration of the parameters. Even before training, this already raises an interesting question: how many different parameter configurations could produce a valid solution for this problem?

Coefficients chose for the neural network (image by author)

We can use the following equations to compute the values of the hidden layers and the output.

The intermediate values A1 and A2 are displayed explicitly. This avoids large, unreadable formulas and makes the forward propagation easy to follow.

Formulas for forward propagation (image by the author)

The dataset has been successfully divided into two distinct classes using the neural network.

Visualization of the output of the neural network — image by the author

1.4 Forward propagation: summary and observations

To recap, we started with a simple training dataset and defined a neural network as an explicit mathematical function, implemented using straightforward Excel formulas and a fixed set of coefficients. By feeding new values of xxx into this function, we were able to visualize the output of the neural network and observe how it separates the data.

Now, if you look closely at the shapes produced by the hidden layer, which contains the two logistic regressions, you can see that there are four possible configurations. They correspond to the different possible orientations of the slopes of the two logistic functions.

Each hidden neuron can have either a positive or a negative slope. With two neurons, this leads to 2×2=4 possible combinations. These different configurations can produce very similar decision boundaries at the output, even though the underlying parameters are different.

This explains why the model can admit multiple solutions for the same classification problem.

The more challenging part is now to determine the values of these coefficients.

This is where backpropagation comes into play.

2. Backpropagation: training the neural network with gradient descent

Once the model is defined, training becomes a numerical problem.

Despite its name, backpropagation is not a separate algorithm. It is simply gradient descent applied to a composed function.

2.1 Reminder of the backpropagation algorithm

The principle is the same for all weight-based models.

We first define the model, that is, the mathematical function that maps the input to the output.

Then we define the loss function. Since this is a binary classification task, we use log loss, exactly as in logistic regression.

Finally, in order to learn the coefficients, we compute the partial derivatives of the loss with respect to each coefficient of the model. These derivatives are what allow us to update the parameters using gradient descent.

Below is a screenshot showing the final formulas for these partial derivatives.

The backpropagation algorithm can then be summarized as follows:

Initialize the weights of the neural network randomly.
Feedforward the inputs through the neural network to get the predicted output.
Calculate the error between the predicted output and the actual output.
Backpropagate the error through the network to calculate the gradient of the loss function with respect to the weights.
Update the weights using the calculated gradient and a learning rate.
Repeat steps 2 to 5 until the model converges.

2.2 Initialization of the coefficients

The dataset is organized in columns to make Excel formulas easy to extend.

The coefficients are initialized with specific values here. You can change them, but convergence is not guaranteed. Depending on the initialization, the gradient descent may converge to a different solution, converge very slowly, or fail to converge altogether.

Initial values for the coefficients (image by author)

2.3 Forward propagation

In the columns from AG to BP, we implement the forward propagation step. We first compute the two hidden activations A1 and A2, and then the output of the network. These are exactly the same formulas as those used earlier to define the forward propagation of the model.

To keep the computations readable, we process each observation separately. As a result, we have 12 columns for the hidden layer outputs (A1 and A2) and 12 columns for the output layer.

Instead of writing a single summation formula, we compute the values observation by observation. This avoids very large and hard-to-read formulas, and it makes the logic of the computations much clearer.

This column-wise organization also makes it easy to mimic a for-loop during gradient descent: the formulas can simply be extended by row to represent successive iterations.

2.4 Errors and the cost function

In the columns from BQ to CN, we compute the error terms and the values of the cost function.

For each observation, we evaluate the log loss based on the predicted output and the true label. These individual losses are then combined to obtain the total cost for the each iteration.

Errors and cost function (image by author)

2.5 Partial derivatives

We now move to the computation of the partial derivatives.

The neural network has 7 coefficients, so we need to compute 7 partial derivatives, one for each parameter. For each derivative, the computation is done for all 12 observations, which leads to a total of 84 intermediate values.

To keep this manageable, the sheet is carefully organized. The columns are grouped and color-coded so that each derivative can be followed easily.

In the columns from CO to DL, we compute the partial derivatives associated with a11 and a12.

In the columns from DM to EJ, we compute the partial derivatives associated with b11 and b12.

In the columns from EK to FH, we compute the partial derivatives associated with a21 and a22.

In the columns from FI to FT, we compute the partial derivatives associated with b2.

And to wrap it up, we sum the partial derivatives across the 12 observations.

The resulting gradients are grouped and shown in the columns from Z to FI.

2.6 Updating weights in a for loop

These partial derivatives allow us to perform gradient descent for each coefficient. The updates are computed in the columns from R to X.

At each iteration, we can observe how the coefficients evolve. The value of the cost function is shown in column Y, which makes it easy to see whether the descent is working and whether the loss is decreasing.

After updating the coefficients at each step of the for loop, we recompute the output of the neural network.

If the initial values of the coefficients are poorly chosen, the algorithm may fail to converge or may converge to an undesired solution, even with a reasonable step size.

Local minimum neural network (Image by author)

The GIF below shows the output of the neural network at each iteration of the for loop. It helps visualize how the model evolves during training and how the decision boundary gradually converges toward a solution.

Neural network output visualization with weights updating — Image by author

Conclusion

We have now completed the full implementation of a neural network classifier, from forward propagation to backpropagation, using only explicit formulas.

By building everything step by step, we have seen that a neural network is nothing more than a mathematical function, trained by gradient descent. Forward propagation defines what the model computes. Backpropagation tells us how to adjust the coefficients to reduce the loss.

This file allows you to experiment freely: you can change the dataset, modify the initial values of the coefficients, and observe how the training behaves. Depending on the initialization, the model may converge quickly, converge to a different solution, or get stuck in a local minimum.

Through this exercise, the mechanics of neural networks become concrete. Once these foundations are clear, using high-level libraries feels much less opaque, because you know exactly what is happening behind the scenes.