Assume we are building a model from scratch, based on a chosen architecture.
- Define a performance metric P.
- Define the Cost Function C.
- Pick a parameter-initialization strategy W.
- Pick a cost-function optimization algorithm A.
- Choose a Hypterparameter Tuning strategy T.
- Pick a combination H of hyperparameter values using the tuning strategy T.
- Initialize model M parameters using strategy W.
- Train M, using algorithm A, parametrized with hyperparameters H, to optimize cost function C.
- If there are still untested hyperparameter values, pick another combination H of hyperparameter values using strategy T, and repeat step 7 and 8.
- Return the model for which the metric P was optimized.
We choose P, C, W, A, and T depending on data characteristics, model complexity, and training dynamics.
# | Step | Examples |
---|---|---|
1 | Performance metric (P) | - Accuracy (classification) |
- AUC-ROC (binary classification) | ||
- F1-score (imbalanced data) | ||
- MSE / RMSE (regression) | ||
2 | Cost/loss function (C) | - Cross-entropy loss (classification) |
- Mean Squared Error (MSE) (regression) | ||
- Categorical cross-entropy (multiclass) | ||
3 | Parameter initialization strategy (W) | - Xavier (Glorot) Initialization (tanh activations) |
- He Initialization (ReLU activations) | ||
- Random Uniform / Normal (can be unstable) | ||
4 | Optimization algorithm (A) | - Stochastic Gradient Descent (SGD) |
- Adam | ||
- RMSProp | ||
- Adagrad | ||
5 | Hyperparameter tuning strategy (T) | - Grid Search |
- Random Search | ||
- Bayesian Optimization | ||
- Hyperband | ||
- Manual Tuning |
#Performance Metric (P)
- Measures how good the model is , i.e. we want to maximize this.
- We define a metric that would allow comparing the performance of two models on the holdout data and select the better of the two.
#Cost Function (C)
- Measures how bad the model is during training, i.e. we want to minimize this.
- Define what our learning algorithm will optimize to train a model.
- Regression: MSE
- Classification:
- Categorical Cross-Entropy for multiclass classification
- Binary Cross-Entropy for binary and multi-label classification.
#Parameter Initialization Strategy (W)
- Decides how the model starts learning. Influences learning speed and stability.
#Optimization Algorithm (A)
- Determines how weights get updated after computing loss gradients.
#Hyperparameter tuning strategy (T)
- Find best hyperparameters, e.g. learning rate, batch size, architecture depth.
- Neural networks start with unknown weights/bias, updated by training.
- Initialization gives a starting point for optimization (gradient descent).
- Good initialization: faster convergence, stable gradients (avoiding vanishing/exploding) and better model performance.
- Bad initialization:
- All weights $=0$ or 1: no learning because of gradients collapse or identical updates.
- Too large weights: exploding gradients; too small weights: vanishing gradients.
- Common initialization strategies: random normal, random uniform, Xavier, He.
- PyTorch:
nn.init.xavier_uniform_(model.layer.weight)
nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')
nn.init.zeros_(layer.bias) \# Bias often initialized to zero
- Training and validation cycle:
- If performance does not improve, we can pick a different combination of hyperparameters and build a different model.
- We will continue to test different values of hyperparameters until there are no more values to test.
- Then we keep the best model among those we trained in the process.
- If the performance of the best model is still not satisfactory, we try a different network architecture, add more labeled data, or try transfer learning.
- Hyperparameter space:
- Hyperparameter values strongly affect the properties of a trained NN.
- Hyperparameter tuning is time and resource intensive.
- We must decide which hyperparameters are important enough to spend the time on.
- While there is no definitive answer. While training:
- Start with default values and then change them (e.g., see the ‘cards’ in Lecture 16 + PyTorch)
- Observe to which parameters the model is more sensitive
- Tuning hyperparameters, as opposed to using the default value
- Tuning the hyperparameters to which the model is sensitive.
- Table shows approximate sensitivity of a model to some hyperparameters.
Hyperparameter | Sensitivity |
---|---|
Learning rate | High |
Learning rate schedule | High |
Loss function | High |
Units per layer | High |
Parameter initialization strategy | Medium |
Number of layers | Medium |
Layer properties | Medium |
Degree of regularization | Medium |
Choice of optimizer | Low |
Optimizer properties | Low |
Size of minibatch | Low |
Choice of non-linearity | Low |
#Common Configurations
#Classification
- Performance Metric (P)
- Accuracy for balanced dataset
- F1-score for imbalanced dataset
- Cost/loss function (C)
- Cross-entropy for multiclass or multilabel classification
- Parameter Initialization Strategy (W)
- Xavier when using tanh as the activation function.
- He when using ReLU as the activation function
- Optimization Algorithm (A)
- Adam with adaptive learning rate and handles sparse gradients well
- Stochastic gradient descent (SGD) strong generalization, stability for large datasets
- Hyperparameter tuning strategy (T)
- Random Search over learning rate, dropout rate, embedding dimension, layers
- Bayesian
#Regression
- Performance Metric (P)
- Root Mean Squared Error (RMSE) for data without large errors and outliers.
- Mean Absolute Error (MAE) for data with outliers and large errors
- Cost/loss function (C)
- Mean Squared Error (MSE) penalizes large errors, i.e. very sensitive to outliers
- Huber Loss combines MSE and MAE to lower sensitivity to outliers.
- Parameter Initialization Strategy (W)
- He initialization when using ReLU as the activation function
- Optimization Algorithm (A)
- RMSProp with time-series, RNNs, and sequential data but needs tuning
- Adam general purpose default choice for tabular data, images, text; robust and stable.
- Hyperparameter tuning strategy (T)
- Grid Search
#Imbalanced Classification
- Performance Metric (P)
- F1
- AUC
- Cost/loss function (C)
- Weighted cross-entropy
- Parameter Initialization Strategy (W)
- Xavier
- Optimization Algorithm (A)
- AdamW
- Hyperparameter tuning strategy (T)
- Bayesian Optimization
#Time Series Forecasting
- Performance Metric (P)
- RMSE
- MAE
- Cost/loss function (C)
- MSE
- Parameter Initialization Strategy (W)
- He
- Optimization Algorithm (A)
- RMSProp
- Hyperparameter tuning strategy (T)
- Grid Search
#Image Tasks (CNNs)
- Performance Metric (P)
- Accuracy
- Cost/loss function (C)
- Cross-Entropy
- Parameter Initialization Strategy (W)
- He
- Optimization Algorithm (A)
- SGD + Momentum
- Hyperparameter tuning strategy (T)
- Hyperband