- For sequential data, connections form a loop to keep memory of past inputs
- Use: speech recognition, time-series forecasting, machine translation, chatbots.
- Data flow: loops allow information to persist across different time steps, making it suitable for sequence learning.
- Structure: Input layer → Recurrent hidden layers (with memory state) → Output.
- Activation function: Tanh/ReLU in hidden layers, Softmax for sequence classification.
- Loss function: Cross-entropy loss for classification, MSE for regression tasks.
- Learning: Uses gradient descent and backpropagation through time (BPTT) to update weights.
- Pros: Can handle sequential data, remembers previous inputs, good for time-dependent tasks.
- Cons: Suffers from vanishing gradients, struggles with long-term dependencies, slow training.
#Example Architecture
- RNNs have loops to feed back information from previous steps into the network.
- RNNs remember prior inputs, ideal for tasks where context is important
#Hyperparameters
- Embedding dimensionality of words
- Size of memory (dimensionality of memory vector)
#RNNs Unrolled
- Once unrolled, RNNs become chains of repeating units, often called “cells”.
- RNNs take one input at a time and update an internal memory called hidden state.
-
$h_t$ = Hidden state at time step $t$ (memory)
-
$X_t$ = Input at time step $t$
-
$W_x$ = Weights for input $X_t$
-
$W_h$ = Weights for the hidden state (previous memory)
-
$b$ = bias
-
$f$ = Activation function (usually $\tanh$ or ReLU)
-
Each cell is a “mini-network” that processes sequential data step-by-step with a set of neurons that apply transformations over time.
-
tanh
(hyperbolic tangent) is the activation function used after the sum product of current and past weights/inputs. -
softmax
is one of the activation functions used for the output layer. -
Note: an RNN can produce an output (i.e. prediction) at each iteration and/or pass the hidden state to the next cycle without an output.
#RNN Patterns
- Vector to Sequence (one to many)
- e.g. image captioning, input an image and get the caption one word at a time.
- Sequence to Vector (many to one)
- e.g. spam classifier, reads in all email one word at a time and returns a spam/no-spam single token.
- Encoder-Decoder (many to many)
- e.g. translation, gets a sentence in and then gets a translated out.
- Sequence to Sequence (many to many)
- e.g. price forecasting, reads in a time series of stock volume/technical indicators and returns a price prediction for each input at time t.
#RNN Text Example
-
Dataset
- “I love machine learning”
- “Deep learning is very difficult”
- “Learning models "
-
FNNs took the whole sentence as a single vector; RNNs process each word sequentially.
-
Instead of Bag of Words (which ignores order), we represent each sentence as a sequence of word indices, i.e. each sentence is a sequence of numbers.
-
Instead of using BoW, we use word embeddings:
- Convert each word into a dense vector of real numbers.
- Capture relationships between words, e.g. “love” and “like” are similar.
-
Each word index is mapped to a pre-trained or trainable embedding vector. This is done using an embedding layer in neural networks.
-
Converted the sentence to an embedding matrix, i.e. sequence of dense vectors:
- $[0.5, 0.2, 0.1, 0.8], [0.9, 0.1, 0.7, 0.4], [0.3, 0.8, 0.5, 0.6], [0.6, 0.7, 0.3, 0.9]$
- $[0.5, 0.2, 0.1, 0.8], [0.9, 0.1, 0.7, 0.4], [0.3, 0.8, 0.5, 0.6], [0.6, 0.7, 0.3, 0.9]$
-
We pass each element of the sequence to the RNN, one element at each time $t$.
-
The RNN updates the hidden state $h$ at each time $t$, holding the entire sentence into the last hidden variable.
-
Execution processes of “I love machine learning”:
- $t_1 \rightarrow [0.5, 0.2, 0.1, 0.8] \rightarrow h_1 = f(W_x X_1 + W_h h_0) \rightarrow$
- $t_2 \rightarrow [0.9, 0.1, 0.7, 0.4] \rightarrow h_2 = f(W_x X_2 + W_h h_1) \rightarrow$
- $t_3 \rightarrow [0.3, 0.8, 0.5, 0.6] \rightarrow h_3 = f(W_x X_3 + W_h h_2) \rightarrow$
- $t_4 \rightarrow [0.6, 0.7, 0.3, 0.9] \rightarrow h_4 = f(W_x X_4 + W_h h_3).$ Note: the size of the hidden state h is a hyperparameter chosen based on the size of the input and the amount of “memory” we want to give to the RNNs.
-
Assume $h$ is a vector of size 8, each time step >0 compute a matrix like:
-
1st Matrix ($W_X$): transforms the 2nd word from embedding space to hidden space.
-
2nd Matrix ($W_H$): transforms the previous hidden state into the current hidden state.
-
$\tanh$ applies to the resulting 8D vector that stores memory from $h_0$ and $h_1$.
-
We take the final hidden state $h_4$ and pass it through a fully connected layer to predict sentiment:
- $y = \text{Softmax}(W_o h_4 + b_o)$
- $h_4$: final hidden state (memory of the sentence)
- $W_o$, $b_o$: weight matrix of the output layer and bias vector of the output layer
- Softmax converts outputs into probabilities for positive, neutral, or negative
-
We need a fully connected output layer because the hidden state $h_4$ is not in the correct format for classification.
-
$W_o$ and $b_o$ map $h_4$ to a probability distribution over possible sentiment classes.
-
The output of that matrix is
- Positive: 89.1%
- Neutral: 10.7%
- Negative: 0.2%
-
This concludes the forward pass of the RNN.
-
During training, we implement a backpropagation through time (BPTT)
- Compute the loss, comparing the resulting three classes probabilities to the true classes.
- Compute the gradients backward through time.
- Update weights $W_o$, $W_X$, $W_h$ by applying gradient descent.
#Problems with RNN
-
Vanishing and exploding gradients that make long-term dependencies hard to learn.
-
Slow training due to their sequential architecture, e.g. 100 words = 100 steps.
-
Short-term memory due to finite size of hidden states.
-
Bias towards recent inputs due to gradients decay as they backpropagate.
-
RNNs update weights with BPTT, compute the gradient of the loss with respect to parameters using the chain rule.
-
Repeatedly multiplying gradients by the weight matrix.
-
Eigenvalues < 1: weights shrink to near-zero (vanish);
-
Eigenvalues > 1: weights grow uncontrollably (explode)
-
Vanishing gradients:
-
Earlier time steps don’t impact learning, i.e., long-term dependencies are forgotten.
-
E.g., “not” in “I do not like this movie” could be forgotten.
-
Exploding gradients:
-
Weights become too large, getting NaN errors that stop the model from converging.
-
Loss fluctuates widely, and the model might memorize noise instead of patterns.