Feedforward Neural Networks - Yousef's Notes
Feedforward Neural Networks

Feedforward Neural Networks

  • Simplest type of artificial neural network, single/multi-layer perceptron
  • Use: classification and regression tasks for structured data and simple features.
  • Data flow: one direction (input → hidden layers and → output) no loops or cycles.
  • Structure: input layer, N hidden layers, output layer.
  • Activation function:
    • Rectified linear unit (ReLU) for hidden layers
    • Sigmoid for binary classification
    • Softmax for multi-class classification.
  • Loss function:
    • MSE
    • cross-entropy loss
  • Learning: gradient descent + chain of calculus to optimize/compute weight updates
  • Pros: simple implementation.
  • Cons: not efficient for image, needs large datasets for deep architectures, can suffer from vanishing gradients without careful design.

#Execution Model

  • Features are encoded, and each Feature is mapped to an input neuron ($X_n$)
  • Each feature is mapped into a neuron of the following layer alongside a weight ($W_n$)
  • Each receiving neuron performs two operations:
    • Weighted sum: $Z=WX+b$
    • Activation function: e.g. $f(z)=max(0,z)$
  • This repeats for each neuron of each layer until the output layer outputs prediction $\hat{Y}$.
  • Usually, hidden and output layers have different activation functions.
  • After the model outputs $\hat{Y}$, we compare it to the true value $Y$ using a Loss Function:
    • Binary Classification: Binary cross-entropy
    • Multi-class classification: categorical cross-entropy
    • Regression: mean squared error (MSE)
  • We adjust the weights (Backpropagation) using Gradient Descent
  • Compute how much each weight contributed to the error.
  • Adjust weights to reduce the error.
  • Repeat for many iterations (epochs) until the model improves.

#Examples

  • Look at how a FNN processes textual input.
  • Sentiment Analysis: Classify the sentiment as positive
  • Text Classification: Categorize sentences into topics (e.g. ML vs Science)
  • How is input passed to the Neural Network?
  • Encoding, e.g. One-Hot Encoding
  • Mapping each encoded feature to a set of neurons as large as the feature set
  • Dataset
  • “I love machine learning”
  • “Deep learning is very difficult”
  • “Learning architectures is useful”
  • Bag of Words encoder: 3 records, each with 10 features
  • Now that we have encoded the records, we need to map them to the neurons of the input layer to start to train the Neural Network.
  • We use the hyperparameter batch size to decide how many records to pass to the NN at each training iteration. Batch Size = 1
  • We pass a single input record, e.g., $X = [1, 0, 0, 0, 1, 1, 1, 0, 0, 0]$.
  • We choose a weight initialization method (another hyperparameter) and get a weight vector, e.g., $W = [-0.26, -0.23, -0.25, -0.41, -1.16, 0.42, 0.36, -0.68, -0.19, -0.33]$.
  • We choose a bias initialization method (yep, another hyperparameter) and get a bias vector, one bias value for each neuron of a layer.
  • We compute the weighted sum ($Z = WX + b$):
$$ ((-0.26*1) + (-0.23*0) + (-0.25*0) + (-0.41*0) + (-1.16*1) + (0.42*1) + (0.36*1) + (-0.68*0) + (-0.19*0) + (-0.33*0)) + b $$

Why Bias? With the bias, the decision boundaries can be anywhere in the input space, not just through the origin. More flexible.

Batch Size > 1, e.g. 3

  • We pass three input records to the NN as a matrix instead of as a vector.
$$ X = \begin{bmatrix} 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{bmatrix} $$
  • Assuming a hidden layer with four neurons, we have a weighted matrix, e.g:
$$ W = \begin{bmatrix} 0.2 & 0.5 & 0.1 & 0.7 \\ 0.3 & 0.8 & 0.2 & 0.6 \\ 0.5 & 0.3 & 0.4 & 0.9 \\ 0.6 & 0.2 & 0.5 & 0.4 \\ 0.9 & 0.1 & 0.8 & 0.3 \\ 0.7 & 0.4 & 0.2 & 0.1 \\ 0.5 & 0.6 & 0.7 & 0.2 \\ 0.3 & 0.7 & 0.9 & 0.8 \\ 0.4 & 0.9 & 0.6 & 0.5 \\ 0.1 & 0.2 & 0.3 & 0.7 \\ \end{bmatrix} $$
  • Compute weighted sums Z for the hidden layer $Z=XW$
$$ z = \sum_{i=1}^{n} X_i W_i + b $$ $$ Z = \begin{bmatrix} z_{1,1} & z_{1,2} & z_{1,3} & z_{1,4} \\ z_{2,1} & z_{2,2} & z_{2,3} & z_{2,4} \\ z_{3,1} & z_{3,2} & z_{3,3} & z_{3,4} \\ \end{bmatrix} $$
  • This is why we need GPUs/TPUs. Matrix multiplications can be efficiently parallelized across thousands of cores (e.g. GTX 4090 has 16384 CUDA cores)
  • Batch sizes vary depending on hardware, dataset size, and model type:
    • CPU: 16-32; GPU: 64-128: Server-grade GPU: 256-1024+; TPU: 512-4096

#Implementation

import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
# Generate a synthetic dataset (binary classification)
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
# Split dataset into training (70%), validation (15%), and testing (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit on training data and transform
X_val = scaler.transform(X_val) # Transform validation data using same scaler
X_test = scaler.transform(X_test) # Transform test data using same scaler
  • Standardization makes feature scale uniform.
  • Prevents features with larger ranges from dominating those with smaller ranges.
  • NN overfits larger ranges.
  • Slow/fail NN training with very different feature scales.
  • Exploding/vanishing activations in deeper layers.
  • Note: Fits to training set; transforms on validation and test

#Hyperparameters

  • Model hyperparameters:

    • Number of layers
    • Number of neurons
    • Activation function
  • Compile hyperparameters:

    • Optimizer
    • Loss function
    • Metric of training goodness
  • Regularization hyperparmeters:

    • Dynamically adjust epochs
    • Metric to observe
  • Tuning examples:

    • Number of neurons/layer
    • Different activations
  • Training hyperparameters

    • num. epochs
    • batch size
    • validation data
  • Measuring whether the model is overfitting

  • Training vs. Validation Accuracy

  • If the validation accuracy follows the training accuracy, the model is learning correctly.

  • If the validation accuracy diverges, the model may be overfitting.

  • Training vs. Validation Loss

  • If validation loss is decreasing, the model generalizes well.

  • If validation loss increases while training loss decreases, the model is overfitting.

  • Final training results:

  • Accuracy on Test Set: 0.8667

  • Loss on Test Set: 0.3260

  • Pytorch and Keras offer multiple optimization methods

  • E.g., EarlyStopping implements a callback for the model.fit method.

  • The callback fires at every epoch, reporting the value of a user-defined (hyperparameter) metric.

  • Here, we use ’loss’ and stop when it does not decrease for 5 consecutive epochs.

  • We space

  • Final training results with 15 epochs instead of 50 :

  • Final Accuracy on Test Set: 0.8533 Vs. 0.8667

  • Final Loss on Test Set: 0.3459 Vs. 0.3260