The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

One day, a data scientist told that Ridge Regression was a complicated model. Because he saw that the training formula is more complicated.

Well, this is exactly the objective of my Machine Learning “Advent Calendar”, to clarify this kind of complexity.

So, ile, we will talk about penalized versions of linear regression.

First, we will see why the regularization or penalization is necessary, and we will see how the model is modified
Then we will explore different types of regularization and their effects.
We will also train the model with regularization and test different hyperparameters.
We will also ask a further question about how to weight the weights in the penalization term. (confused ? You will see)

Linear regression and its “conditions”

When we talk about linear regression, people often mention that some conditions should be satisfied.

You may have heard statements like:

the residuals should be Gaussian (it is sometimes confused with the target being Gaussian, which is false)
the explanatory variables should not be collinear

In classical statistics, these conditions are required for inference. In machine learning, the focus is on prediction, so these assumptions are less central, but the underlying issues still exist.

Here, we will see an example of two features being collinear, and let’s make them completely equal.

And we have the relationship: y = x1 + x2, and x1 = x2

I know that if they are completely equal, we can just do: y=2*x1. But the idea is to say they can be very similar, and we can always build a model using them, right?

Then what is the problem?

When features are perfectly collinear, the solution is not unique. Here is an example in the screenshot below.

y = 10000*x1 – 9998*x2

Ridge and Lasso in Excel – all images by author

And we can notice that the norm of the coefficients is huge.

So, the idea is to limit the norm of the coefficients.

And after applying the regularization, the conceptual model is the same!

That is right. The parameters of the linear regression are changed. But the model is the same.

Different Versions of Regularization

So the idea is to combine the MSE and the norm of the coefficients.

Instead of just minimizing the MSE, we try to minimize the sum of the two terms.

Which norm? We can do with norm L1, L2, or even combine them.

There are three classical ways to do this, and the corresponding model names.

Ridge regression (L2 penalty)

Ridge regression adds a penalty on the squared values of the coefficients.

Intuitively:

large coefficients are heavily penalized (because of the square)
coefficients are pushed toward zero
but they never become exactly zero

Effect:

all features remain in the model
coefficients are smoother and more stable
very effective against collinearity

Ridge shrinks, but does not select.

Ridge regression in Excel – All images by author

Lasso regression (L1 penalty)

Lasso uses a different penalty: the absolute value of the coefficients.

This small change has a big consequence.

With Lasso:

some coefficients can become exactly zero
the model automatically ignores some features

This is why LASSO is called so, because it stands for Least Absolute Shrinkage and Selection Operator.

Operator: it refers to the regularization operator added to the loss function
Least: it is derived from a least-squares regression framework
Absolute: it uses the absolute value of the coefficients (L1 norm)
Shrinkage: it shrinks coefficients toward zero
Selection: it can set some coefficients exactly to zero, performing feature selection

Important nuance:

we can say that the model still has the same number of coefficients
but some of them are forced to zero during training

The model form is unchanged, but Lasso effectively removes features by driving coefficients to zero.

3. Elastic Net (L1 + L2)

Elastic Net is a combination of Ridge and Lasso.

It uses:

an L1 penalty (like Lasso)
and an L2 penalty (like Ridge)

Why combine them?

Because:

Lasso can be unstable when features are highly correlated
Ridge handles collinearity well but does not select features

Elastic Net gives a balance between:

stability
shrinkage
sparsity

It is often the most practical choice in real datasets.

What really changes: model, training, tuning

Let us look at this from a Machine Learning point of view.

The model does not really change

For the model, for all the regularized versions, we still write:

y =a x + b.

Same number of coefficients
Same prediction formula
But, the coefficients will be different.

From a certain perspective, Ridge, Lasso, and Elastic Net are not different models.

The training principle is also the same

We still:

define a loss function
minimize it
compute gradients
update coefficients

The only difference is:

the loss function now includes a penalty term

That is it.

The hyperparameters are added (this is the real difference)

For Linear regression, we do not have the control of the “complexity” of the model.

Standard linear regression: no hyperparameter
Ridge: one hyperparameter (lambda)
Lasso: one hyperparameter (lambda)
Elastic Net: two hyperparameters
- one for overall regularization strength
- one to balance L1 vs L2

So:

standard linear regression does not need tuning
penalized regressions do

This is why standard linear regression is often seen as “not really Machine Learning”, while regularized versions clearly are.

Implementation of Regularized gradients

We keep the gradient descent of OLS regression as reference, and for Ridge regression, we only have to add the regularization term for the coefficient.

We will use a simple dataset that I generated (the same one we already used for Linear Regression).

We can see the 3 “models” differ in terms of coefficients. And the goal in this chapter is to implement the gradient for all the models and compare them.

Ridge lasso regression in Excel – All images by author

Ridge with penalized gradient

First, we can do for Ridge, and we only have to change the gradient of a.

Now, it does not mean that the value b is not changed, since the gradient of b is each step depends also on a.

LASSO with penalized gradient

Then we can do the same for LASSO.

And the only difference is also the gradient of a.

For each model, we can also calculate the MSE and the regularized MSE. It is quite satisfying to see how they decrease over the iterations.

Comparison of the coefficients

Now, we can visualize the coefficient a for all the three models. In order to see the differences, we input very large lambdas.

Impact of lambda

For large value of lambda, we will see that the coefficient a becomes small.

And if lambda LASSO becomes extremely large, then we theoretically get the value of 0 for a. Numerically, we have to improve the gradient descent.

Regularized Logistic Regression?

We saw Logistic Regression yesterday, and one question we can ask is if it can also be regularized. If yes, how are they called?

The answer is of course yes, Logistic Regression can be regularized

Exactly the same idea applies.

Logistic regression can also be:

L1 penalized
L2 penalized
Elastic Net penalized

There are no special names like “Ridge Logistic Regression” in common usage.

Why?

Because the concept is no longer new.

In practice, libraries like scikit-learn simply let you specify:

the loss function
the penalty type
the regularization strength

The naming mattered when the idea was new.
Now, regularization is just a standard option.

Conclusion

Ridge and Lasso do not change the linear model itself, they change how the coefficients are learned. By adding a penalty, regularization favors stable and meaningful solutions, especially when features are correlated. Seeing this process step by step in Excel makes it clear that these methods are not more complex, just more controlled.

Source link

#Machine #Learning #Advent #Calendar #Day #LASSO #Ridge #Regression #Excel