Issues with RNN models
- Linear interaction distance: words that should interacte with each other have a linear distance between them.
- Lack of parallelizability: Forward and backward passes have
unparallelizable operations. But GPUs can perform a bunch of independent computations at once.
Word window
We can stack more layers to get more contexts, but there are always some finite fild.
Attention
Let’s see attention within a single sentence.
The computation between attention layers is unparallelizable. But within the same layer, it can be done in parallel.
Self-Attention
A simple explaination: for every word embedding
Just as the equation showed, we use
Problem 1: Sequnce Order
Self-Attention doesn’t build in order information, we need to encode the order of sentence in our keys, quries and values.
We can represent each sequence index as a vector
There are several ways to build
- Position representation vectors through sinusoids
- Position representation vectors learned from scratch
Problem 2: No nonlinearities for DL
It’s all just weighted average.
There is an easy fix: add a feed-forward network to post-process each output vector. The FF network processes the result attention.
Problem 3: Need to ensure we don’t “look at the future” when predicting a sequence
Fix: masking the future in self-attention (we do this for all decoder layers, not just the first layer)
To use self-attention in decoders, we need to ensure we cant’t peek at the future (or the prediction is useless).
To enable parallelization, we mask out attention to future words by setting attention scores to
When we calculate the attention scores, only the words before the current word could have a normal scores. For all words after the current word, the score is just
The Transformaer Encoder-Decoder
Key-Query-Value Attention
Multi-headed Attention
For words
Residual Connections
Layer Normalization
Scaled Dot Product
Cross Attention (details)
Could think as the same operation but different source of inputs.
Problems
- Quadratic compute in self-attention: Comuputing all pairs of interactions means our computation grow quadratically with the sequence length.
- Position representation
Comments NOTHING