Attention is All You Need (Vaswani et al.)

June 20, 2024

Motivation: the inherently sequential nature of RNNs + LSTMs makes it harder to parallelize + expensive

1.ByteNet, ConvS2S, Extended Neural GPU -> reduce sequential computation
1.becomes expensive to learn dependencies between distant tokens
2.Transformer -> learning relationships between distant tokens is O(1)
-transformer reduces the "resolution"/quality of each single

Attention: mapping a query and a set of key-value pairs to an output

Scaled Dot Product Attention

-compute weighted sum of values (V) based on the similarity (dot product) between query vectors (Q) and key vectors (K)
-helps to learn which parts of an input sequence are important for producing an output
-Query - the specific token/word that we are trying to calculate attention for
-derived from input embeddings
-Key - all tokens/words from the input that we are comparing to understand relationships
-derived from input embeddings
-help determine how relevant other words are to the current word/token represented by Q
-Value - the numeric representation of the meaning, syntactic role, and context of a word
-Output = weighted sum of values, where each weight is the similarity between query and key

i.e in the sentence "I love cats and dogs" which is split into the tokens \\[I, love, cats, and, dogs\\] and embedded accordingly:

-Q will iterate over the embeddings of each token, i.e. "cats" in one iteration, "dogs" in another iteration, etc.
-K will hold the value of all other tokens, and compares each to the query of the iteration, i.e. "cats"
-V will hold the learned meaning of the word, i.e. "cats"
-Q, K, and V are all learned independently of each other

Root (Original) Problem: word-by-word translation does not work effectively when grammatical structures are different

Solution: attention mechanism that has access to all elements in the sequence at each time step (not sequential)

context - surrounding elements in a sequence that provide additional meaning/relevance to a specific element

multi-head attention - run attention with multiple heads to attend to information from different representations at different positions

-each head is randomly initialized, and Q, K, and V are learned

Encoder: break down and understand a sentence

1.Input Embedding - dense vector representation using an embedding layer
2.Positional Encoding - transformers do not have a built-in sense of order, encodings added to provide info about word position
3.Multi-Head Self Attention - used to capture importance of words and relationships within a sequence

Decoder: take understanding from encoder and produce an output

1.Input Embedding (during training) - target sequence is embedded
2.Positional Encoding (during training) - encodings to provide info about word position
3.Masked Multi-Head Self-Attention - maintain the causal structure during generation, ensuring that words are generated based on only words that were already generated
1."masking" - prediction for words only depends on previous words not future ones (autoregressive)
2.simulate the conditions of inference during model training (prevent overfitting)
4.Multi-Head Attention Over Encoder Output
1.allows the decoder to use information from the entire input sequence for generation
2.important for tasks where input matches output, i.e. translation
5.Feed-Forward NN