Parallelized Neural Network using self-attention mechanism instead of recurrence to process sequential data.
- Use
- Natural Langage processing (NLP), speech recognition, machine translation, chatbots, large language models (e.g. GPT, BERT)
- Data Flow
- Uses multi-head self-attention to capture relationships among tokens in parallel.
- Structure
- Input Embedding → Multi-head self-attention layers → Feedforward layers → Output Layer.
- Activation Function
- ReLU in hidden layers.
- Softmax for classification tasks.
- Loss Function
- Cross-Entropy Loss for classification.
- MSE for regression.
- Learning
- Uses transformer-based attention mechanisms, trained using Adam optimizer and large datasets.
- Pros
- Cons
- Massive datasets
- Computationally expensive
- Hard to interpret
#Architecture
- Process features in parallel. E.g., sentence: processes words/n-grams in parallel, not sequentially.
- Processing all the features at the same time, enables learning relational patterns among them.
- Examples:
- Learn contextual use of words: “The needle has a sharp point.” Vs “It is not polite to point at people.”
- Learn references: “I went to France, after midterm, and there I ate the delicious food of that country”
- Architecture:
- Encoder (left); decoder (right).
- Self-attention computes relationships between words.
- Stacked layers (Nx times) build deep representations.
- Self-attention replaces recurrence.
#Encoder
- Left side: encoder processes the input sequence.
- Right side: decoder generates the output sequence.
- Encoder/decoder use multi-head self-attention + feedforward layers. N*
- Nx indicates a number N of identical concurrent layers.
- Input embedding: word embeddings contain semantic meaning but lack positional order.
- Positional encoding: add order information to embeddings. Note: RNN used recurrence.
- Multi-Head Self-Attention: each word attends to all other words in the sequence; one head per relation type
- Add & Norm: residual connection (Add) helps gradient flow; Layer normalization (Norm) stabilizes training.
- FFWN: concurrently applies transformations to words.
#Self-Attention
-
Self-Attention: key mechanism that allows Transformers to process input sequences without recurrence.
-
Instead of processing one word at a time (like RNNs),
-
Each word is transformed into three vectors:
- Query (Q): “What am I looking for?”
- Key (K): “What do I have?”
- Value (V): “What information do I pass forward?”
-
Each word compares itself to every other word in the sequence.
-
Words that are relevant to each other get higher attention scores.
-
The model weighs the words based on these scores before making a decision.
-
Input sequence:
- E.g., “The cat sat on the mat”.
- Tokenized and embedded into a vector $X$ (as seen for RNNs).
-
Each word in $X$ is multiplied by three weight matrices $\left(W_{Q}, W_{K^{\prime}}, W_{V}\right)$, creating three new matrices: $Q=X W_{Q}, K=X W_{K^{\prime}}, V=X W_{V}$
-
Compute attention scores:
- Use the scaled dot-product attention $\quad$ Attention $=$ softmax $\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$
- $Q K^{T}$ : Computes the similarity between each query and each key.
- $\sqrt{d_{k}}$ : Scaling factor to prevent large values.
- Softmax: converts scores into probabilities.
- Multiplying by V: weighs the words based on their importance.
-
Compute final attention output:
- Words that are more relevant get a higher weight.
- The attention mechanism highlights important words while downplaying irrelevant ones. E.g., given “cat”, “sat” will be more important than “mat”
#Learning Flow
-
Encoder encodes the input sentence, i.e., the sentence to translate
-
Decoder decodes the input sentence using the target sentence (i.e., the actual translation) to train the neural networks.
-
In the training phase, both the encoder and decoder process the input sentence (encoder) and the target sentence (decoder) simultaneously.
-
Parallelism is offered by the multi-head attention (replacing the recursion in RNN)
-
Masking: the first word of the decoder is the control word
. Without it the encoder would see the future, i.e., the word that it has to predict. -
The encoder gets the sentence as input; the decoder predicts one word at a time.
-
At inference time, the encoder runs sequentially, and every step is based on previously predicted words and encoder context.
-
Within each decoding step, computation (self-attention, encoder-decoder attention, FFN) remains parallel.
-
However, sequential dependency across decoding steps prevents full parallelization across the entire output sequence.