Transformer_Introduce
The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations.
- 代码讲解地址:http://nlp.seas.harvard.edu/2018/04/03/attention.html
1. Embedding
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
The word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
2. Encode
an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
-
Self-Attention
- create vectors from each of the encoder’s input vectors (in this case, the embedding of each word). For each word, we create a Query vector, a Key vector, and a Value vector.
- calculating self-attention is to calculate a score 这一步具体是怎么实现的
Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
- third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
- fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
- sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
3. Matrix Calculation of Self-Attention
- the first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).
- condense steps two through six in one formula to calculate the outputs of the self-attention layer.
RNNs maintain a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
4. The Beast With Many Heads
- “multi-headed” attention
- It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
- It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices
- concat the matrices then multiple them by an additional weights matrix WO.
- Multi-Headed self-attention visualization
As we encode the word “it”, one attention head is focusing most on “the animal”, while another is focusing on “tired” – in a sense, the model’s representation of the word “it” bakes in some of the representation of both “animal” and “tired”.
5. Representing The Order of The Sequence Using Positional Encoding
helps it determine the position of each word, or the distance between different words in the sequence.
In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.
6. Residuals
7. Decoder Side
The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
8. Final Linear and Softmax Layer
The decoder stack outputs a vector of floats.The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
9.Go Forth And Transform
- Watch Łukasz Kaiser’s talk walking through the model and its details
- Play with the Jupyter Notebook provided as part of the Tensor2Tensor repo
- Explore the Tensor2Tensor repo.
Follow-up works:
- Depthwise Separable Convolutions for Neural Machine Translation
- One Model To Learn Them All
- Discrete Autoencoders for Sequence Models
- Generating Wikipedia by Summarizing Long Sequences
- Image Transformer
- Training Tips for the Transformer Model
- Self-Attention with Relative Position Representations
- Fast Decoding in Sequence Models using Discrete Latent Variables
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
转载学习于:
- https://jalammar.github.io/illustrated-transformer/
- 视频介绍:https://www.youtube.com/watch?v=rBCqOTEfxvg
author: Google Brain; Google Research date: 2017 keyword:
- model
Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017. cited by 11535
Paper: Attention
Summary
- propose the transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
- the Transformer allows for significantly more parallelization and can reach a new state o fthe art in translation quality.
Methods
- system overview:
【Module One】 Encoder and Decoder Stacks
- Encoder
- composed of a stack of N=6 identical layers
- each layer has multi-head self-attention mechanism, and position-wise fully connected feed-forward network.
- employ a residual connection around each of the two sub-layers, followed by layer normalization.
- Decoder
- composed of stack of N=6 identical layers;
- the decoder also inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
- modify the self-attention sub-layer in the decoder stack to pervent positions from attending to subsequent positions, called masking, ensures that the predictions for position i can depend only on the knownn outputs at positions less than i.
【Attention】
- Scaled Dot-Product Attention
$$ Attention(Q,K,V)=softmax(QK^T/sqrt(d_k))V $$
- Multi-Head Attention
$$ MultiHead(Q,K,V)=Concat(head_1,…,head_h)W^o\ head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)\ W_i^Q \epsilon R^{d_{model}d_k}\ W^o \epsilon R^{hd_vd_{model}} $$
【Application of Attention 】
- allow every position in the decoder to attend over all positions in the input sequence;
- each position in the encoder can attend to all positions in the previous layer of the encoder;
【Position-wise Feed-Forward Networks 】 $$ FFN(x)=max(0,xW_1+b_1)W_2+b_2 $$