...

A Visual Guide to Tuning Gradient Boosted Trees


Introduction

My previous posts looked at the bog-standard decision tree and the wonder of a random forest. Now, to complete the triplet, I’ll visually explore !

There are a bunch of gradient boosted tree libraries, including XGBoost, CatBoost, and LightGBM. However, for this I’m going to use sklearn’s one. Why? Simply because, compared with the others, it allowed me to visualise easier. In practice I tend to use the other libraries more than the sklearn one; however, this project is about visual learning, not pure performance.

Fundamentally, a GBT is a combination of trees that only work together. While a single decision tree (including one extracted from a random forest) can make a decent prediction by itself, taking an individual tree from a GBT is unlikely to give anything usable.

Beyond this, as always, no theory, no maths — just plots and hyperparameters. As before, I’ll be using the California housing dataset via scikit-learn (CC-BY), the same general process as described in my previous posts, the code is at https://github.com/jamesdeluk/data-projects/tree/main/visualising-trees, and all images below are created by me (apart from the GIF, which is from Tenor).

A basic gradient boosted tree

Starting with a basic GBT: gb = GradientBoostingRegressor(random_state=42). Similar to other tree types, the default settings for min_samples_split, min_samples_leaf, max_leaf_nodes are 2, 1, None respectively. Interestingly, the default max_depth is 3, not None as it is with decision trees/random forests. Notable hyperparameters, which I’ll look into more later, include learning_rate (how steep the gradient is, default 0.1), and n_estimators (similar to random forest — the number of trees).

Fitting took 2.2s, predicting took 0.005s, and the results:

Metric max_depth=None
MAE 0.369
MAPE 0.216
MSE 0.289
RMSE 0.538
0.779

So, quicker than the default random forest, but slightly worse performance. For my chosen block, it predicted 0.803 (actual 0.894).

Visualising

This is why you’re here, right?

The tree

Similar to before, we can plot a single tree. This is the first one, accessed with gb.estimators_[0, 0]:

I’ve explained these in the previous posts, so I won’t do so again here. One thing I will bring to your attention though: notice how terrible the values are! Three of the leaves even have negative values, which we know cannot be the case. This is why a GBT only works as a combined ensemble, not as separate standalone trees like in a random forest.

Predictions and errors

My favourite way to visualise GBTs is with prediction vs iteration plots, using gb.staged_predict. For my chosen block:

Remember the default model has 100 estimators? Well, here they are. The initial prediction was way off — 2! But each time it learnt (remember learning_rate?), and got closer to the real value. Of course, it was trained on the training data, not this specific data, so the final value was off (0.803, so about 10% off), but you can clearly see the process.

In this case, it reached a fairly steady state after about 50 iterations. Later we’ll see how to stop iterating at this stage, to avoid wasting time and money.

Similarly, the error (i.e. the prediction minus the true value) can be plotted. Of course, this gives us the same plot, simply with different y-axis values:

Let’s take this one step further! The test data has over 5000 blocks to predict; we can loop through each, and predict them all, for each iteration!

I love this plot.

They all start around 2, but explode across the iterations. We know all the true values vary from 0.15 to 5, with a mean of 2.1 (check my first post), so this spreading out of predictions (from ~0.3 to ~5.5) is as expected.

We can also plot the errors:

At first glance, it seems a bit strange — we’d expect them to start at, say, ±2, and converge on 0. Looking carefully though, this does happen for most — it can be seen in the left-hand side of the plot, the first 10 iterations or so. The problem is, with over 5000 lines on this plot, there are a lot of overlapping ones, making the outliers stand out more. Perhaps there’s a better way to visualise these? How about…

The median error is 0.05 — which is very good! The IQR is less than 0.5, which is also decent. So, while there are some terrible predictions, most are decent.

Hyperparameter tuning

Decision tree hyperparameters

Same as before, let’s compare how the hyperparameters explored in the original decision tree post apply to GBTs, with the default hyperparameters of learning_rate = 0.1, n_estimators = 100. The min_samples_leaf, min_samples_split, and max_leaf_nodes one also have max_depth = 10, to make it a fair comparison to previous posts and to each other.

Model max_depth=None max_depth=10 min_samples_leaf=10 min_samples_split=10 max_leaf_nodes=100
Fit Time (s) 10.889 7.009 7.101 7.015 6.167
Predict Time (s) 0.089 0.019 0.015 0.018 0.013
MAE 0.454 0.304 0.301 0.302 0.301
MAPE 0.253 0.177 0.174 0.174 0.175
MSE 0.496 0.222 0.212 0.217 0.210
RMSE 0.704 0.471 0.46 0.466 0.458
0.621 0.830 0.838 0.834 0.840
Chosen Prediction 0.885 0.906 0.962 0.918 0.923
Chosen Error 0.009 0.012 0.068 0.024 0.029

Unlike decision trees and random forests, the deeper tree performed far worse! And took longer to fit. However, increasing the depth from 3 (the default) to 10 has improved the scores. The other constraints resulted in further improvements — again showing how all hyperparameters can play a role.

learning_rate

GBTs operate by tweaking predictions after each iteration based on the error.  The higher the adjustment (a.k.a. the gradient, a.k.a. the learning rate), the more the prediction changes between iterations.

There is a clear trade-off for learning rate. Comparing learning rates of 0.01 (Slow), 0.1 (Default), and 0.5 (Fast), over 100 iterations:

Faster learning rates can get to the correct value quicker, but they’re more likely to overcorrect and jump past the true value (think fishtailing in a car), and can lead to oscillations. Slow learning rates may never reach the correct value (think… not turning the steering wheel enough and driving straight into a tree). As for the stats:

Model Default Fast Slow
Fit Time (s) 2.159 2.288 2.166
Predict Time (s) 0.005 0.004 0.015
MAE 0.370 0.338 0.629
MAPE 0.216 0.197 0.427
MSE 0.289 0.247 0.661
RMSE 0.538 0.497 0.813
0.779 0.811 0.495
Chosen Prediction 0.803 0.949 1.44
Chosen Error 0.091 0.055 0.546

Unsurprisingly, the slow learning model was terrible. For this block, fast was slightly better than the default overall. However, we can see on the plot how, at least for the chosen block, it was the last 90 iterations that got the fast model to be more accurate than the default one — if we’d stopped at 40 iterations, for the chosen block at least, the default model would have been far better. The joys of visualisation!

n_estimators

As mentioned above, the number of estimators goes hand in hand with the learning rate. In general, the more estimators the better, as it gives more iterations to measure and adjust for the error — although this comes at an additional time cost.

As seen above, a sufficiently high number of estimators is especially important for a low learning rate, to ensure the correct value is reached. Increasing the number of estimators to 500:

With enough iterations, the slow learning GBT did reach the true value. In fact, they all ended up much closer. The stats confirm this:

Model DefaultMore FastMore SlowMore
Fit Time (s) 12.254 12.489 11.918
Predict Time (s) 0.018 0.014 0.022
MAE 0.323 0.319 0.410
MAPE 0.187 0.185 0.248
MSE 0.232 0.228 0.338
RMSE 0.482 0.477 0.581
0.823 0.826 0.742
Chosen Prediction 0.841 0.921 0.858
Chosen Error 0.053 0.027 0.036

Unsurprisingly, increasing the number of estimators five-fold increased the time to fit significantly (in this case by six-fold, but that may just be a one-off). However, we still haven’t surpassed the scores of the constrained trees above — I guess we’ll need to do a hyperparameter search to see if we can beat them. Also, for the chosen block, as can be seen in the plot, after about 300 iterations none of the models really improved. If this is consistent across all the data, then the extra 700 iterations were unnecessary. I mentioned earlier about how it’s possible to avoid wasting time iterating without improving; now’s time to look into that.

n_iter_no_change, validation_fraction, and tol

It’s possible for additional iterations to not improve the final result, yet it still takes time to run them. This is where early stopping comes in.

There are three relevant hyperparameters. The first, n_iter_no_change, is how many iterations for there to be “no change” before doing no more iterations. tol[erance] is how big the change in validation score should be to be classified as “no change”. And validation_fraction is how much of the training data to be used as a validation set to generate the validation score (note this is separate from the test data).

Comparing a 1000-estimator GBT with one with a fairly aggressive early stopping — n_iter_no_change=5, validation_fraction=0.1, tol=0.005 — the latter one stopped after only 61 estimators (and hence only took 5~6% of the time to fit):

As expected though, the results were worse:

Model Default Early Stopping
Fit Time (s) 24.843 1.304
Predict Time (s) 0.042 0.003
MAE 0.313 0.396
MAPE 0.181 0.236
MSE 0.222 0.321
RMSE 0.471 0.566
0.830 0.755
Chosen Prediction 0.837 0.805
Chosen Error 0.057 0.089

But as always, the question to ask: is it worth investing 20x the time to improve the R² by 10%, or reducing the error by 20%?

Bayes searching

You were probably expecting this. The search spaces:

search_spaces = {
    'learning_rate': (0.01, 0.5),
    'max_depth': (1, 100),
    'max_features': (0.1, 1.0, 'uniform'),
    'max_leaf_nodes': (2, 20000),
    'min_samples_leaf': (1, 100),
    'min_samples_split': (2, 100),
    'n_estimators': (50, 1000),
}

Most are similar to my previous posts; the only additional hyperparameter is learning_rate.

It took the longest so far, at 96 mins (~50% more than the random forest!) The best hyperparameters are:

best_parameters = OrderedDict({
    'learning_rate': 0.04345459461297153,
    'max_depth': 13,
    'max_features': 0.4993693929975871,
    'max_leaf_nodes': 20000,
    'min_samples_leaf': 1,
    'min_samples_split': 83,
    'n_estimators': 325,
})

max_features, max_leaf_nodes, and min_samples_leaf, are very similar to the tuned random forest. n_estimators is too, and it aligns with what the chosen block plot above suggested — the extra 700 iterations were mostly unnecessary. However, compared with the tuned random forest, the trees are only a third as deep, and min_samples_split is far higher than we’ve seen so far. The value of learning_rate was not too surprising based on what we saw above.

And the cross-validated scores:

Metric Mean Std
MAE -0.289 0.005
MAPE -0.161 0.004
MSE -0.200 0.008
RMSE -0.448 0.009
0.849 0.006

Of all the models so far, this is the best, with smaller errors, higher R², and lower variances!

Finally, our old friend, the box plots:

Conclusion

And so we come to the end of my mini-series on the three most common types of tree-based models.

My hope is that, by seeing different ways of visualising trees, you now (a) better understand how the different models function, without having to look at equations, and (b) can use your own plots to tune your own models. It can also help with stakeholder management — execs prefer pretty pictures to tables of numbers, so showing them a tree plot can help them understand why what they’re asking you to do is impossible.

Based on this dataset, and these models, the gradient boosted one was slightly superior to the random forest, and both were far superior to a lone decision tree. However, this may have been because the GBT had 50% more time to search for better hyperparameters (they typically are more computationally expensive — after all, it was the same number of iterations). It’s also worth noting that GBTs have a higher tendency to overfit than random forests. And while the decision tree had worse performance, it is far faster — and in some use cases, this is more important. Additionally, as mentioned, there are other libraries, with pros and cons — for example, CatBoost handles categorical data out of the box, whereas other GBT libraries typically require categorical data to be preprocessed (e.g. one-hot or label encoding). Or, if you’re feeling really brave, how about stacking the different tree types in an ensemble for even better performance…

Anyway, until next time!

Source link

#Visual #Guide #Tuning #Gradient #Boosted #Trees