Introduction
My previous posts looked at the bog-standard decision tree and the wonder of a random forest. Now, to complete the triplet, I’ll visually explore !
There are a bunch of gradient boosted tree libraries, including XGBoost, CatBoost, and LightGBM. However, for this I’m going to use sklearn’s one. Why? Simply because, compared with the others, it allowed me to visualise easier. In practice I tend to use the other libraries more than the sklearn one; however, this project is about visual learning, not pure performance.
Fundamentally, a GBT is a combination of trees that only work together. While a single decision tree (including one extracted from a random forest) can make a decent prediction by itself, taking an individual tree from a GBT is unlikely to give anything usable.
Beyond this, as always, no theory, no maths — just plots and hyperparameters. As before, I’ll be using the California housing dataset via scikit-learn (CC-BY), the same general process as described in my previous posts, the code is at https://github.com/jamesdeluk/data-projects/tree/main/visualising-trees, and all images below are created by me (apart from the GIF, which is from Tenor).
A basic gradient boosted tree
Starting with a basic GBT: gb = GradientBoostingRegressor(random_state=42)
. Similar to other tree types, the default settings for min_samples_split
, min_samples_leaf
, max_leaf_nodes
are 2, 1, None
respectively. Interestingly, the default max_depth
is 3, not None
as it is with decision trees/random forests. Notable hyperparameters, which I’ll look into more later, include learning_rate
(how steep the gradient is, default 0.1), and n_estimators
(similar to random forest — the number of trees).
Fitting took 2.2s, predicting took 0.005s, and the results:
Metric | max_depth=None |
---|---|
MAE | 0.369 |
MAPE | 0.216 |
MSE | 0.289 |
RMSE | 0.538 |
R² | 0.779 |
So, quicker than the default random forest, but slightly worse performance. For my chosen block, it predicted 0.803 (actual 0.894).
Visualising
This is why you’re here, right?
The tree
Similar to before, we can plot a single tree. This is the first one, accessed with gb.estimators_[0, 0]
:
I’ve explained these in the previous posts, so I won’t do so again here. One thing I will bring to your attention though: notice how terrible the values are! Three of the leaves even have negative values, which we know cannot be the case. This is why a GBT only works as a combined ensemble, not as separate standalone trees like in a random forest.
Predictions and errors
My favourite way to visualise GBTs is with prediction vs iteration plots, using gb.staged_predict
. For my chosen block:
Remember the default model has 100 estimators? Well, here they are. The initial prediction was way off — 2! But each time it learnt (remember learning_rate
?), and got closer to the real value. Of course, it was trained on the training data, not this specific data, so the final value was off (0.803, so about 10% off), but you can clearly see the process.
In this case, it reached a fairly steady state after about 50 iterations. Later we’ll see how to stop iterating at this stage, to avoid wasting time and money.
Similarly, the error (i.e. the prediction minus the true value) can be plotted. Of course, this gives us the same plot, simply with different y-axis values:
Let’s take this one step further! The test data has over 5000 blocks to predict; we can loop through each, and predict them all, for each iteration!
I love this plot.
They all start around 2, but explode across the iterations. We know all the true values vary from 0.15 to 5, with a mean of 2.1 (check my first post), so this spreading out of predictions (from ~0.3 to ~5.5) is as expected.
We can also plot the errors:
At first glance, it seems a bit strange — we’d expect them to start at, say, ±2, and converge on 0. Looking carefully though, this does happen for most — it can be seen in the left-hand side of the plot, the first 10 iterations or so. The problem is, with over 5000 lines on this plot, there are a lot of overlapping ones, making the outliers stand out more. Perhaps there’s a better way to visualise these? How about…
The median error is 0.05 — which is very good! The IQR is less than 0.5, which is also decent. So, while there are some terrible predictions, most are decent.
Hyperparameter tuning
Decision tree hyperparameters
Same as before, let’s compare how the hyperparameters explored in the original decision tree post apply to GBTs, with the default hyperparameters of learning_rate = 0.1, n_estimators = 100
. The min_samples_leaf
, min_samples_split
, and max_leaf_nodes
one also have max_depth = 10
, to make it a fair comparison to previous posts and to each other.
Model | max_depth=None | max_depth=10 | min_samples_leaf=10 | min_samples_split=10 | max_leaf_nodes=100 |
---|---|---|---|---|---|
Fit Time (s) | 10.889 | 7.009 | 7.101 | 7.015 | 6.167 |
Predict Time (s) | 0.089 | 0.019 | 0.015 | 0.018 | 0.013 |
MAE | 0.454 | 0.304 | 0.301 | 0.302 | 0.301 |
MAPE | 0.253 | 0.177 | 0.174 | 0.174 | 0.175 |
MSE | 0.496 | 0.222 | 0.212 | 0.217 | 0.210 |
RMSE | 0.704 | 0.471 | 0.46 | 0.466 | 0.458 |
R² | 0.621 | 0.830 | 0.838 | 0.834 | 0.840 |
Chosen Prediction | 0.885 | 0.906 | 0.962 | 0.918 | 0.923 |
Chosen Error | 0.009 | 0.012 | 0.068 | 0.024 | 0.029 |
Unlike decision trees and random forests, the deeper tree performed far worse! And took longer to fit. However, increasing the depth from 3 (the default) to 10 has improved the scores. The other constraints resulted in further improvements — again showing how all hyperparameters can play a role.
learning_rate
GBTs operate by tweaking predictions after each iteration based on the error. The higher the adjustment (a.k.a. the gradient, a.k.a. the learning rate), the more the prediction changes between iterations.
There is a clear trade-off for learning rate. Comparing learning rates of 0.01 (Slow), 0.1 (Default), and 0.5 (Fast), over 100 iterations:
Faster learning rates can get to the correct value quicker, but they’re more likely to overcorrect and jump past the true value (think fishtailing in a car), and can lead to oscillations. Slow learning rates may never reach the correct value (think… not turning the steering wheel enough and driving straight into a tree). As for the stats:
Model | Default | Fast | Slow |
---|---|---|---|
Fit Time (s) | 2.159 | 2.288 | 2.166 |
Predict Time (s) | 0.005 | 0.004 | 0.015 |
MAE | 0.370 | 0.338 | 0.629 |
MAPE | 0.216 | 0.197 | 0.427 |
MSE | 0.289 | 0.247 | 0.661 |
RMSE | 0.538 | 0.497 | 0.813 |
R² | 0.779 | 0.811 | 0.495 |
Chosen Prediction | 0.803 | 0.949 | 1.44 |
Chosen Error | 0.091 | 0.055 | 0.546 |
Unsurprisingly, the slow learning model was terrible. For this block, fast was slightly better than the default overall. However, we can see on the plot how, at least for the chosen block, it was the last 90 iterations that got the fast model to be more accurate than the default one — if we’d stopped at 40 iterations, for the chosen block at least, the default model would have been far better. The joys of visualisation!
n_estimators
As mentioned above, the number of estimators goes hand in hand with the learning rate. In general, the more estimators the better, as it gives more iterations to measure and adjust for the error — although this comes at an additional time cost.
As seen above, a sufficiently high number of estimators is especially important for a low learning rate, to ensure the correct value is reached. Increasing the number of estimators to 500:
With enough iterations, the slow learning GBT did reach the true value. In fact, they all ended up much closer. The stats confirm this:
Model | DefaultMore | FastMore | SlowMore |
---|---|---|---|
Fit Time (s) | 12.254 | 12.489 | 11.918 |
Predict Time (s) | 0.018 | 0.014 | 0.022 |
MAE | 0.323 | 0.319 | 0.410 |
MAPE | 0.187 | 0.185 | 0.248 |
MSE | 0.232 | 0.228 | 0.338 |
RMSE | 0.482 | 0.477 | 0.581 |
R² | 0.823 | 0.826 | 0.742 |
Chosen Prediction | 0.841 | 0.921 | 0.858 |
Chosen Error | 0.053 | 0.027 | 0.036 |
Unsurprisingly, increasing the number of estimators five-fold increased the time to fit significantly (in this case by six-fold, but that may just be a one-off). However, we still haven’t surpassed the scores of the constrained trees above — I guess we’ll need to do a hyperparameter search to see if we can beat them. Also, for the chosen block, as can be seen in the plot, after about 300 iterations none of the models really improved. If this is consistent across all the data, then the extra 700 iterations were unnecessary. I mentioned earlier about how it’s possible to avoid wasting time iterating without improving; now’s time to look into that.
n_iter_no_change, validation_fraction, and tol
It’s possible for additional iterations to not improve the final result, yet it still takes time to run them. This is where early stopping comes in.
There are three relevant hyperparameters. The first, n_iter_no_change
, is how many iterations for there to be “no change” before doing no more iterations. tol
[erance] is how big the change in validation score should be to be classified as “no change”. And validation_fraction
is how much of the training data to be used as a validation set to generate the validation score (note this is separate from the test data).
Comparing a 1000-estimator GBT with one with a fairly aggressive early stopping — n_iter_no_change=5, validation_fraction=0.1, tol=0.005
— the latter one stopped after only 61 estimators (and hence only took 5~6% of the time to fit):
As expected though, the results were worse:
Model | Default | Early Stopping |
---|---|---|
Fit Time (s) | 24.843 | 1.304 |
Predict Time (s) | 0.042 | 0.003 |
MAE | 0.313 | 0.396 |
MAPE | 0.181 | 0.236 |
MSE | 0.222 | 0.321 |
RMSE | 0.471 | 0.566 |
R² | 0.830 | 0.755 |
Chosen Prediction | 0.837 | 0.805 |
Chosen Error | 0.057 | 0.089 |
But as always, the question to ask: is it worth investing 20x the time to improve the R² by 10%, or reducing the error by 20%?
Bayes searching
You were probably expecting this. The search spaces:
search_spaces = {
'learning_rate': (0.01, 0.5),
'max_depth': (1, 100),
'max_features': (0.1, 1.0, 'uniform'),
'max_leaf_nodes': (2, 20000),
'min_samples_leaf': (1, 100),
'min_samples_split': (2, 100),
'n_estimators': (50, 1000),
}
Most are similar to my previous posts; the only additional hyperparameter is learning_rate
.
It took the longest so far, at 96 mins (~50% more than the random forest!) The best hyperparameters are:
best_parameters = OrderedDict({
'learning_rate': 0.04345459461297153,
'max_depth': 13,
'max_features': 0.4993693929975871,
'max_leaf_nodes': 20000,
'min_samples_leaf': 1,
'min_samples_split': 83,
'n_estimators': 325,
})
max_features
, max_leaf_nodes
, and min_samples_leaf
, are very similar to the tuned random forest. n_estimators
is too, and it aligns with what the chosen block plot above suggested — the extra 700 iterations were mostly unnecessary. However, compared with the tuned random forest, the trees are only a third as deep, and min_samples_split
is far higher than we’ve seen so far. The value of learning_rate
was not too surprising based on what we saw above.
And the cross-validated scores:
Metric | Mean | Std |
---|---|---|
MAE | -0.289 | 0.005 |
MAPE | -0.161 | 0.004 |
MSE | -0.200 | 0.008 |
RMSE | -0.448 | 0.009 |
R² | 0.849 | 0.006 |
Of all the models so far, this is the best, with smaller errors, higher R², and lower variances!
Finally, our old friend, the box plots:
Conclusion
And so we come to the end of my mini-series on the three most common types of tree-based models.
My hope is that, by seeing different ways of visualising trees, you now (a) better understand how the different models function, without having to look at equations, and (b) can use your own plots to tune your own models. It can also help with stakeholder management — execs prefer pretty pictures to tables of numbers, so showing them a tree plot can help them understand why what they’re asking you to do is impossible.
Based on this dataset, and these models, the gradient boosted one was slightly superior to the random forest, and both were far superior to a lone decision tree. However, this may have been because the GBT had 50% more time to search for better hyperparameters (they typically are more computationally expensive — after all, it was the same number of iterations). It’s also worth noting that GBTs have a higher tendency to overfit than random forests. And while the decision tree had worse performance, it is far faster — and in some use cases, this is more important. Additionally, as mentioned, there are other libraries, with pros and cons — for example, CatBoost handles categorical data out of the box, whereas other GBT libraries typically require categorical data to be preprocessed (e.g. one-hot or label encoding). Or, if you’re feeling really brave, how about stacking the different tree types in an ensemble for even better performance…
Anyway, until next time!
Source link
#Visual #Guide #Tuning #Gradient #Boosted #Trees