...

Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models | by Miguel Cardona Polo | Jan, 2025


Full explanation on Linear Regression and how it learns

The Crane Stance. Public Domain image from Openverse

Just like Mr. Miyagi taught young Daniel LaRusso karate through repetitive simple chores, which ultimately transformed him into the Karate Kid, mastering foundational algorithms like linear regression lays the groundwork for understanding the most complex of AI architectures such as Deep Neural Networks and LLMs.

Through this deep dive into the simple yet powerful linear regression, you will learn many of the fundamental parts that make up the most advanced models built today by billion-dollar companies.

Linear regression is a simple mathematical method used to understand the relationship between two variables and make predictions. Given some data points, such as the one below, linear regression attempts to draw the line of best fit through these points. It’s the “wax on, wax off” of data science.

An image showing many points on a graph being modelled by linear regression by tracing the line of best fit through those points
Example of linear regression model on a graph. Image captured by Author

Once this line is drawn, we have a model that we can use to predict new values. In the above example, given a new house size, we could attempt to predict its price with the linear regression model.

The Linear Regression Formula

The formula of linear regression
Labelled Linear Regression Formula. Image captured by Author

Y is the dependent variable, that which you want to calculate — the house price in the previous example. Its value depends on other variables, hence its name.

X are the independent variables. These are the factors that influence the value of Y. When modelling, the independent variables are the input to the model, and what the model spits out is the prediction or Ŷ.

β are parameters. We give the name parameter to those values that the model adjusts (or learns) to capture the relationship between the independent variables X and the dependent variable Y. So, as the model is trained, the input of the model will remain the same, but the parameters will be adjusted to better predict the desired output.

Parameter Learning

We require a few things to be able to adjust the parameters and achieve accurate predictions.

  1. Training Data — this data consists of input and output pairs. The inputs will be fed into the model and during training, the parameters will be adjusted in an attempt to output the target value.
  2. Cost function — also known as the loss function, is a mathematical function that measures how well a model’s prediction matches the target value.
  3. Training Algorithm — is a method used to adjust the parameters of the model to minimise the error as measured by the cost function.

Let’s go over a cost function and training algorithm that can be used in linear regression.

MSE is a commonly used cost function in regression problems, where the goal is to predict a continuous value. This is different from classification tasks, such as predicting the next token in a vocabulary, as in Large Language Models. MSE focuses on numerical differences and is used in a variety of regression and neural network problems, this is how you calculate it:

The formula of mean squared error (mse)
Mean Squared Error (MSE) formula. Image captured by Author
  1. Calculate the difference between the predicted value, Ŷ, and the target value, Y.
  2. Square this difference — ensuring all errors are positive and also penalising large errors more heavily.
  3. Sum the squared differences for all data samples
  4. Divide the sum by the number of samples, n, to get the average squared error

You will notice that as our prediction gets closer to the target value the MSE gets lower, and the further away they are the larger it grows. Both ways progress quadratically because the difference is squared.

The concept of gradient descent is that we can travel through the “cost space” in small steps, with the objective of arriving at the global minimum — the lowest value in the space. The cost function evaluates how well the current model parameters predict the target by giving us the loss value. Randomly modifying the parameters does not guarantee any improvements. But, if we examine the gradient of the loss function with respect to each parameter, i.e. the direction of the loss after an update of the parameter, we can adjust the parameters to move towards a lower loss, indicating that our predictions are getting closer to the target values.

Labelled graph showing the key concepts of the gradient descent algorithm. The local and global minimum, the learning rate and how it makes the position advance towards a lower cost
Labelled graph showing the key concepts of the gradient descent algorithm. Image captured by Author

The steps in gradient descent must be carefully sized to balance progress and precision. If the steps are too large, we risk overshooting the global minimum and missing it entirely. On the other hand, if the steps are too small, the updates will become inefficient and time-consuming, increasing the likelihood of getting stuck in a local minimum instead of reaching the desired global minimum.

Gradient Descent Formula

Labelled gradient descent formula
Labelled Gradient Descent formula. Image captured by Author

In the context of linear regression, θ could be β0 or β1. The gradient is the partial derivative of the cost function with respect to θ, or in simpler terms, it is a measure of how much the cost function changes when the parameter θ is slightly adjusted.

A large gradient indicates that the parameter has a significant effect on the cost function, while a small gradient suggests a minor effect. The sign of the gradient indicates the direction of change for the cost function. A negative gradient means the cost function will decrease as the parameter increases, while a positive gradient means it will increase.

So, in the case of a large negative gradient, what happens to the parameter? Well, the negative sign in front of the learning rate will cancel with the negative sign of the gradient, resulting in an addition to the parameter. And since the gradient is large we will be adding a large number to it. So, the parameter is adjusted substantially reflecting its greater influence on reducing the cost function.

Let’s take a look at the prices of the sponges Karate Kid used to wash Mr. Miyagi’s car. If we wanted to predict their price (dependent variable) based on their height and width (independent variables), we could model it using linear regression.

We can start with these three training data samples.

Training data for the linear regression example modelling prices of sponges
Training data for the linear regression example modelling prices of sponges. Image captured by Author

Now, let’s use the Mean Square Error (MSE) as our cost function J, and linear regression as our model.

Formula for the cost function derived from MSE and linear regression
Formula for the cost function derived from MSE and linear regression. Image captured by Author

The linear regression formula uses X1 and X2 for width and height respectively, notice there are no more independent variables since our training data doesn’t include more. That is the assumption we take in this example, that the width and height of the sponge are enough to predict its price.

Now, the first step is to initialise the parameters, in this case to 0. We can then feed the independent variables into the model to get our predictions, Ŷ, and check how far these are from our target Y.

Step 0 in gradient descent algorithm and the calculation of the mean squared error
Step 0 in gradient descent algorithm and the calculation of the mean squared error. Image captured by Author

Right now, as you can imagine, the parameters are not very helpful. But we are now prepared to use the Gradient Descent algorithm to update the parameters into more useful ones. First, we need to calculate the partial derivatives of each parameter, which will require some calculus, but luckily we only need to this once in the whole process.

Working out of the partial derivatives of the linear regression parameters.
Working out of the partial derivatives of the linear regression parameters. Image captured by Author

With the partial derivatives, we can substitute in the values from our errors to calculate the gradient of each parameter.

Calculation of parameter gradients
Calculation of parameter gradients. Image captured by Author

Notice there wasn’t any need to calculate the MSE, as it’s not directly used in the process of updating parameters, only its derivative is. It’s also immediately apparent that all gradients are negative, meaning that all can be increased to reduce the cost function. The next step is to update the parameters with a learning rate, which is a hyper-parameter, i.e. a configuration setting in a machine learning model that is specified before the training process begins. Unlike model parameters, which are learned during training, hyper-parameters are set manually and control aspects of the learning process. Here we arbitrarily use 0.01.

Parameter updating in the first iteration of gradient descent
Parameter updating in the first iteration of gradient descent. Image captured by Author

This has been the final step of our first iteration in the process of gradient descent. We can use these new parameter values to make new predictions and recalculate the MSE of our model.

Last step in the first iteration of gradient descent, and recalculation of MSE after parameter updates
Last step in the first iteration of gradient descent, and recalculation of MSE after parameter updates. Image captured by Author

The new parameters are getting closer to the true sponge prices, and have yielded a much lower MSE, but there is a lot more training left to do. If we iterate through the gradient descent algorithm 50 times, this time using Python instead of doing it by hand — since Mr. Miyagi never said anything about coding — we will reach the following values.

Results of some iterations of the gradient descent algorithm, and a graph showing the MSE over the gradient descent steps
Results of some iterations of the gradient descent algorithm, and a graph showing the MSE over the gradient descent steps. Image captured by Author

Eventually we arrived to a pretty good model. The true values I used to generate those numbers were [1, 2, 3] and after only 50 iterations, the model’s parameters came impressively close. Extending the training to 200 steps, which is another hyper-parameter, with the same learning rate allowed the linear regression model to converge almost perfectly to the true parameters, demonstrating the power of gradient descent.

Many of the fundamental concepts that make up the complicated martial art of artificial intelligence, like cost functions and gradient descent, can be thoroughly understood just by studying the simple “wax on, wax off” tool that linear regression is.

Artificial intelligence is a vast and complex field, built upon many ideas and methods. While there’s much more to explore, mastering these fundamentals is a significant first step. Hopefully, this article has brought you closer to that goal, one “wax on, wax off” at a time.

Source link

#Mastering #Basics #Linear #Regression #Unlocks #Secrets #Complex #Models #Miguel #Cardona #Polo #Jan