The Machine Learning and Deep Learning “Advent Calendar” Series: The Blueprint

, it is very easy to train any model. And the training process is always done with the seemingly same method fit. So we get used to this idea that training any model is similar and simple.

With autoML, Grid search, and Gen AI, “training” machine learning models can be done with a simple “prompt”.

But the reality is that, when we do model.fit, behind each model, the process can be very different. And each model itself works very differently with the data.

We can observe two very different trends, almost in two opposite directions:

On the one hand, we train, use, manipulate, and predict with models (such as generative models) more and more complex.
On the other hand, we are not always capable of explaining simple models (such as linear regression, linear discriminant classifier), and recalculating results by hand.

It is important to understand the models we use. And the best way to understand them is to implement them ourselves. Some people do it with Python, R, or other programming languages. But there is still a barrier for those who do not program. And nowadays, understanding AI is essential for everyone. Moreover, using a programming language can also hide some operations behind already existing functions. And it is not visually explained, meaning that each operation is not clearly shown, since the function is coded then run, to only give the results.

So the best tool to explore, in my opinion, is Excel. With the formulas that clearly show every step of the calculations.

In fact, when we receive a dataset, most non-programmers will open it in Excel to understand what is inside. This is very common in the business world.

Even many data scientists, myself included, use Excel to take a quick look. And when it is time to explain the results, showing them directly in Excel is often the most effective way, especially in front of executives.

In Excel, everything is visible. There is no “black box”. You can see every formula, every number, every calculation.

This helps a lot to understand how the models really work, without shortcuts.

Also, you do not need to install anything. Just a spreadsheet.

I will publish a series of articles about how to understand and implement machine learning and deep learning models in Excel.

For the “Advent Calendar”, I will publish one article per day.

Generated by Gemini: “Advent Calendar” of AI

Who is this series for?

For students who are studying, I think that these articles offer a practical point of view. It is to make sense of complex formulas.

For ML or AI developers, who, sometimes, have not studied theory — but now, without complicated algebra, probability, or statistics, you can open the black box behind model.fit. Because for all models, you do model.fit. But in reality, the models can be very different.

This is also for managers who may not have all the technical background, but to whom Excel will give all the intuitive ideas behind the models. Therefore, combined with your business expertise, you can better judge if machine learning is really necessary, and which model might be more suitable.

So, in summary, It is to better understand the models, the training of the models, the interpretability of the models, and the links between different models.

Structure of the articles

From a practitioner’s point of view, we usually categorize the models in the following two categories: supervised learning and unsupervised learning.

Then for supervised learning, we have regression and classification. And for unsupervised learning, we have clustering and dimensionality reduction.

Overview of machine learning models from a practioner’s point of view – image by author

But you surely already notice that some algorithms may share the same or similar approach, such as KNN classifier vs. KNN regressor, decision tree classifier vs. decision tree regressor, linear regression vs. “linear classifier”.

A regression tree and linear regression have the same objective, that is, to do a regression task. But when you try to implement them in Excel, you will see that the regression tree is very close to the classification tree. And linear regression is closer to a neural network.

And sometimes people confuse K-NN with K-means. Some may argue that their goals are completely different, and that confusing them is a beginner’s mistake. BUT, we also have to admit that they share the same approach of calculating distances between the data points. So there is a relationship between them.

The same goes for isolation forest, as we can see that in random forest there also is a “forest”.

So I will organize all the models from a theoretical point of view. There are three main approaches, and we will clearly see how these approaches are implemented in a very different way in Excel.

This overview will help us to navigate through all the different models, and connect the dots between many of them.

Overview of machine learning models organised by theoritial approaches – image by author

For distance-based models, we will calculate local or global distances, between a new observation and the training dataset.
For tree based models, we have to define the splits or rules that will be used to make categories of the features.
For math functions, the idea is to apply weights to features. And to train the model, the gradient descent is mainly used.
For deep learning models, we will that the main point is about feature engineering, to create adequate representation of the data.

For each model, we will try to answer these questions.

General questions about the model:

What is the nature of the model?
How is the model trained?
What are the hyperparameters of the model?
How can the same model approach be used for regression, classification, or even clustering?

How features are modelled:

How are categorical features handled?
How are missing values managed?
For continuous features, does scaling make a difference?
How do we measure the importance of one feature?

How can we qualify the importance of the features? This question will also be discussed. You may know that packages like LIME and SHAP are very popular, and they are model-agnostic. But the truth is that each model behaves quite differently, and it is also interesting, and important to interpret directly with the model.

Relationships between different models

Each model will be in a separate article, but we will discuss the links with other models.

We will also discuss the relationships between different models. Since we truly open each “black box”, we will also know how to make theoretical improvement to some models.

KNN and LDA (Linear Discriminant Analysis) are very close. The first uses a local distance, and the latter uses a global distance.
Gradient boosting is the same as gradient descent, only the vector space is different.
Linear regression is also a classifier.
Label encoding can be, sort of, used for categorical feature, and it can be very useful, very powerful, but you have to choose the “labels” wisely.
SVM is very close to linear regression, even closer to ridge regression.
LASSO and SVM use one similar principle to select features or data points. Do you know that the second S in LASSO is for selection?

For each model, we also will discuss one particular point that most traditional courses will miss. I call it the untaught lesson of the machine learning model.

Model training vs hyperparameter tuning

In these articles, we will focus only on how the models work and how they are trained. We will not discuss hyperparameter tuning, because the process is essentially the same for every model. We typically use grid search.