to tune hyperparamters of deep learning models (Keras Sequential model), in comparison with a traditional approach — Grid Search.
Bayesian Optimization
Bayesian Optimization is a sequential design strategy for global optimization of black-box functions.
It is particularly well-suited for functions that are expensive to evaluate, lack an analytical form, or have unknown derivatives.
In the context of hyperparameter optimization, the unknown function can be:
- an objective function,
- accuracy value for a training or validation set,
- loss value for a training or validation set,
- entropy gained or lost,
- AUC for ROC curves,
- A/B test results,
- computation cost per epoch,
- model size,
- reward amount for reinforcement learning, and more.
Unlike traditional optimization methods that rely on direct function evaluations, Bayesian Optimization builds and refines a probabilistic model of the objective function, using this model to intelligently select the next evaluation point.
The core idea revolves around two key components:
1. Surrogate Model (Probabilistic Model)
The model approximates the unknown objective function (f(x)) to a surrogate model such as Gaussian Process (GP).
A GP is a non-parametric Bayesian model that defines a distribution over functions. It provide:
- a prediction of the function value at a given point μ(x) and
- a measure of uncertainty around that prediction σ(x), often represented as a confidence interval.
Mathematically, for a Gaussian Process, the predictions at an unobserved point (x∗), given observed data (X, y), are normally distributed:
where
- μ(x∗): the mean prediction and
- σ²(x∗): the predictive variance.
2. Acquisition Function
The acquisition function determines a next point (x_t+1) to evaluate by quantifying how “promising” a candidate point is for improving the objective function, by balancing:
- Exploration (High Variance): Sampling in areas with high uncertainty to discover new promising regions and
- Exploitation (High Mean): Sampling in areas where the surrogate model predicts high objective values.
Common acquisition functions include:
Probability of Improvement (PI)
PI selects the point that has the highest probability of improving upon the current best observed value (f(x+)):
where
- Φ: the cumulative distribution function (CDF) of the standard normal distribution, and
- ξ≥0 is a trade-off parameter (exploration vs. exploitation).
ξ controls a trade-off between exploration and exploitation, and a larger ξ encourages more exploration.
Expected Improvement (EI)
Quantifies the expected amount of improvement over the current best observed value:
Assuming a Gaussian Process surrogate, the analytical form of EI is defined:
where ϕ is the probability density function (PDF) of the standard normal distribution.
EI is one of the most widely used acquisition functions. EI also considers the magnitude of the improvement unlike PI.
Upper Confidence Bound (UCB)
UCB balances exploitation (high mean) and exploration (high variance), focusing on points that have both a high predicted mean and high uncertainty:
κ≥0 is a tuning parameter that controls the balance between exploration and exploitation.
A larger κ puts more emphasis on exploring uncertain regions.
Bayesian Optimization Strategy (Iterative Process)
Bayesian Optimization iteratively updates the surrogate model and optimizes the acquisition function.
It guides the search towards optimal regions while minimizing the number of expensive objective function evaluations.
Now, let us see the process with code snippets using KerasTuner
for a fraud detection task (binary classification where y=1 (fraud) costs us the most.)
Step 1. Initialization
Initializes the process by sampling the hyperparameter space randomly or low-discrepancy sequencing (ususally picking up 5 to 10 points) to get an idea of the objective function.
These initial observations are used to build the first version of the surrogate model.
As we build Keras Sequential model, we first define and compile the model, then define theBayesianOptimization
tuner with the number of initial points to assess.
import keras_tuner as kt
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
# initialize a Keras Sequential model
model = Sequential([
Input(shape=(self.input_shape,)),
Dense(
units=hp.Int(
'neurons1', min_value=20, max_value=60, step=10),
activation='relu'
),
Dropout(
hp.Float(
'dropout_rate1', min_value=0.0, max_value=0.5, step=0.1
)),
Dense(
units=hp.Int(
'neurons2', min_value=20, max_value=60, step=10),
activation='relu'
),
Dropout(
hp.Float(
'dropout_rate2', min_value=0.0, max_value=0.5, step=0.1
)),
Dense(
1, activation='sigmoid',
bias_initializer=keras.initializers.Constant(
self.initial_bias_value
)
)
])
# compile the model
model.compile(
optimizer=optimizer,
loss='binary_crossentropy',
metrics=[
'accuracy',
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc')
]
)
# define a tuner with the intial points
tuner = kt.BayesianOptimization(
hypermodel=custom_hypermodel,
objective=kt.Objective("val_recall", direction="max"),
max_trials=max_trials,
executions_per_trial=executions_per_trial,
directory=directory,
project_name=project_name,
num_initial_points=num_initial_points,
overwrite=True,
)
num_initial_points
defines how many initial, randomly selected hyperparameter configurations should be evaluated before the algorithm starts to guide the search.
If not given, KerasTuner takes a default value: 3 * dimensions of the hyperparameter space.
Step 2. Surrogate Model Training
Build and train the probabilistic model (surrogate model, often a Gaussian Process or a Tree-structured Parzen Estimator for Bayesian Optimization) using all available observed datas points (input values and their corresponding output values) to approximate the true function.
The surrogate model provides the mean prediction (μ(x)) (most likely from the Gaussian process) and uncertainty (σ(x)) for any unobserved point.
KerasTuner uses an internal surrogate model to model the relationship between hyperparameters and the objective function’s performance.
After each objective function evaluation via train run, the observed data points (hyperparameters and validation metrics) are used to update the internal surrogate model.
Step 3. Acquisition Function Optimization
Use an optimization algorithm (often a cheap, local optimizer like L-BFGS or even random search) to find the next point (x_t+1) that maximizes the chosen acquisition function.
This step is crucial because it identifies the most promising next candidate for evaluation by balancing exploration (trying new, uncertain areas of the hyperparameter space) and exploitation (refining promising areas).
KerasTuner uses an optimization strategy such as Expected Improvement or Upper Confidence Bound to find the next set of hyperparameters.
Step 4. Objective Function Evaluation
Evaluate the true, expensive objective function (f(x)) at the new candidate point (x_t+1).
The Keras model is trained using the provided training datasets and evaluated on the validation data. We set val_recall
as the result of this evaluation.
def fit(self, hp, model=None, *args, **kwargs):
model = self.build(hp=hp) if not model else model
batch_size = hp.Choice('batch_size', values=[16, 32, 64])
epochs = hp.Int('epochs', min_value=50, max_value=200, step=50)
return model.fit(
batch_size=batch_size,
epochs=epochs,
class_weight=self.class_weights_dict,
*args,
**kwargs
)
Step 5. Data Update
Add the newly observed data point (x_(t+1), f(x_(t+1))) to the set of observations.
Step 6. Iteration
Repeat Step 2 — 5 until a stopping criterion is met.
Technically, the tuner.search()
method orchestrates the entire Bayesian optimization process from Step 2 to 5:
tuner.search(
X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[early_stopping_callback]
)
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
best_keras_model_from_tuner = tuner.get_best_models(num_models=1)[0]
The method repeatedly performs these steps until the max_trials
limit is reached or other internal stopping criteria such as early_stopping_callback
are met.
Here, we set recall
as our key metrics to penalize the misclassification as False Positive costs us the most in the fraud detection case.
Learn More: KerasTuner Source Code
Results
The Bayesian Optimization process aimed to enhance the model’s performance, primarily by maximizing recall.
The tuning efforts yielded a trade-off across key metrics, resulting in a model with significantly improved recall at the expense of some precision and overall accuracy compared to the initial state:
- Recall: 0.9055 (0.6595 -> 0.6450) — 0.8400
- Precision: 0.6831 (0.8338 -> 0.8113) — 0.6747
- Accuracy: 0.7427 (0.7640 -> 0.7475) — 0.7175
(From development (training / validation combined) to test phase)
Best performing hyperparameter set:
- neurons1: 40
- dropout_rate1: 0.0
- neurons2: 20,
- dropout_rate2: 0.4
- optimizer_name: lion,
- learning_rate: 0.004019639999963362
- batch_size: 64
- epochs: 200
- beta_1_lion: 0.9
- beta_2_lion: 0.99
Optimal Neural Network Summary:
Key Performance Metrics:
- Recall: The model demonstrated a significant improvement in recall, increasing from an initial value of approximately 0.66 (or 0.645) to 0.8400. This indicates the optimized model is notably better at identifying positive cases.
- Precision: Concurrently, precision experienced a decrease. Starting from around 0.83 (or 0.81), it settled at 0.6747 post-optimization. This suggests that while more positive cases are being identified, a higher proportion of those identifications might be false positives.
- Accuracy: The overall accuracy of the model also saw a decline, moving from an initial 0.7640 (or 0.7475) down to 0.7175. This is consistent with the observed trade-off between recall and precision, where optimizing for one often impacts the others.
Comparing with Grid Search
We tuned a Keras Sequential model with Grid Search on Adam optimizer for comparison:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from sklearn.model_selection import GridSearchCV
from scikeras.wrappers import KerasClassifier
param_grid = {
'model__learning_rate': [0.001, 0.0005, 0.0001],
'model__neurons1': [20, 30, 40],
'model__neurons2': [20, 30, 40],
'model__dropout_rate1': [0.1, 0.15, 0.2],
'model__dropout_rate2': [0.1, 0.15, 0.2],
'batch_size': [16, 32, 64],
'epochs': [50, 100],
}
input_shape = X_train.shape[1]
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
class_weights = class_weight.compute_class_weight(
class_weight='balanced',
classes=np.unique(y_train),
y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))
keras_classifier = KerasClassifier(
model=create_model,
model__input_shape=input_shape,
model__initial_bias_value=initial_bias,
loss='binary_crossentropy',
metrics=[
'accuracy',
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc')
]
)
grid_search = GridSearchCV(
estimator=keras_classifier,
param_grid=param_grid,
scoring='recall',
cv=3,
n_jobs=-1,
error_score='raise'
)
grid_result = grid_search.fit(
X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[early_stopping_callback],
class_weight=class_weights_dict
)
optimal_params = grid_result.best_params_
best_keras_classifier = grid_result.best_estimator_
Results
Grid Search tuning resulted in a model with strong precision and good overall accuracy, though with a lower recall compared to the Bayesian Optimization approach:
- Recall: 0.8214(0.7735 -> 0.7150)— 0.7100
- Precision: 0.7884 (0.8331 -> 0.8034) — 0.8304
- Accuracy:0.8005 (0.8092 -> 0.7700) — 0.7825
Best performing hyperparameter set:
- neurons1: 40
- dropout_rate1: 0.15
- neurons2: 40
- dropout_rate2: 0.1
- learning_rate: 0.001
- batch_size: 16
- epochs: 100
Optimal Neural Network Summary:
Grid Search Performance:
- Recall: Achieved a recall of 0.7100, a slight decrease from its initial range (0.7735–0.7150).
- Precision: Showed robust performance at 0.8304, an improvement over its initial range (0.8331–0.8034).
- Accuracy: Settled at 0.7825, maintaining a solid overall predictive capability, slightly lower than its initial range (0.8092–0.7700).
Comparison with Bayesian Optimization:
- Recall: Bayesian Optimization (0.8400) significantly outperformed Grid Search (0.7100) in identifying positive cases.
- Precision: Grid Search (0.8304) achieved much higher precision than Bayesian Optimization (0.6747), indicating fewer false positives.
- Accuracy: Grid Search’s accuracy (0.7825) was notably higher than Bayesian Optimization’s (0.7175).
General Comparison with Grid Search
1. Approaching the Search Space
Bayesian Optimization
- Intelligent/Adaptive: Bayesian Optimization builds a probabilistic model (often a Gaussian Process) of the objective function (e.g., model performance as a function of hyperparameters). It uses this model to predict which hyperparameter combinations are most likely to yield better results.
- Informed: It learns from previous evaluations. After each trial, the probabilistic model is updated, guiding the search towards more promising regions of the hyperparameter space. This allows it to make “intelligent” choices about where to sample next, balancing exploration (trying new, unknown regions) and exploitation (focusing on regions that have shown good results).
- Sequential: It typically operates sequentially, evaluating one point at a time and updating its model before selecting the next.
Grid Search:
- Exhaustive/Brute-force: Grid Search systematically tries every possible combination of hyperparameter values from a pre-defined set of values for each hyperparameter. You specify a “grid” of values, and it evaluates every point on that grid.
- Uninformed: It doesn’t use the results of previous evaluations to inform the selection of the next set of hyperparameters to try. Each combination is evaluated independently.
- Deterministic: Given the same grid, it will always explore the same combinations in the same order.
2. Computational Cost
Bayesian Optimization
- More Efficient: Designed to find optimal hyperparameters with significantly fewer evaluations compared to Grid Search. This makes it particularly effective when evaluating the objective function (e.g., training a Machine Learning model) is computationally expensive or time-consuming.
- Scalability: Generally scales better to higher-dimensional hyperparameter spaces than Grid Search, though it can still be computationally intensive for very high dimensions due to the overhead of maintaining and updating the probabilistic model.
Grid Search
- Computationally Expensive: As the number of hyperparameters and the range of values for each hyperparameter increase, the number of combinations grows exponentially. This leads to very long run times and high computational cost, making it impractical for large search spaces. This is often referred to as the “curse of dimensionality.”
- Scalability: Does not scale well with high-dimensional hyperparameter spaces.
3. Guarantees and Exploration
Bayesian Optimization
- Probabilistic guarantee: It aims to find the global optimum efficiently, but it does not offer a hard guarantee like Grid Search for finding the absolute best within a discrete set. Instead, it converges probabilistically towards the optimum.
- Smarter exploration: Its balance of exploration and exploitation helps it avoid getting stuck in local optima and discover optimal values more effectively.
Grid Search
- Guaranteed to find best in grid: If the optimal hyperparameters are within the defined grid, Grid Search is guaranteed to find them because it tries every combination.
- Limited exploration: It can miss optimal values if they fall between the discrete points defined in the grid.
4. When to Use Which
Bayesian Optimization:
- Large, high-dimensional hyperparameter spaces: When evaluating models is expensive and you have many hyperparameters to tune.
- When efficiency is paramount: To find good hyperparameters quickly, especially in situations with limited computational resources or time.
- Black-box optimization problems: When the objective function is complex, non-linear, and doesn’t have a known analytical form.
Grid Search
- Small, low-dimensional hyperparameter spaces: When you have only a few hyperparameters and a limited number of values for each, Grid Search can be a simple and effective choice.
- When exhaustiveness is critical: If you absolutely need to explore every single defined combination.
Conclusion
The experiment effectively demonstrated the distinct strengths of Bayesian Optimization and Grid Search in hyperparameter tuning.
Bayesian Optimization, by design, proved highly effective at intelligently navigating the search space and prioritizing a specific objective, in this case, maximizing recall.
It successfully achieved a higher recall rate (0.8400) compared to Grid Search, indicating its ability to find more positive instances.
This capability comes with an inherent trade-off, leading to reduced precision and overall accuracy.
Such an outcome is highly valuable in applications where minimizing false negatives is critical (e.g., medical diagnosis, fraud detection).
Its efficiency, stemming from probabilistic modeling that guides the search towards promising areas, makes it a preferred method for optimizing costly experiments or simulations where each evaluation is expensive.
In contrast, Grid Search, while exhaustive, yielded a more balanced model with superior precision (0.8304) and overall accuracy (0.7825).
This suggests Grid Search was more conservative in its predictions, resulting in fewer false positives.
In summary, while Grid Search offers a straightforward and exhaustive approach, Bayesian Optimization stands out as a more sophisticated and efficient method capable of finding superior results with fewer evaluations, particularly when optimizing for a specific, often complex, objective like maximizing recall in a high-dimensional space.
The optimal choice of tuning method ultimately depends on the specific performance priorities and resource constraints of the application.
Author: Kuriko IWAI
Portfolio / LinkedIn / Github
May 26, 2025
All images, unless otherwise noted, are by the author.
The article utilizes synthetic data, licensed under Apache 2.0 for commercial use.
Source link
#Bayesian #Optimization #Hyperparameter #Tuning #Deep #Learning #Models