Seq2Seq 笔记


本文是Seq2Seq的最经典论文Sequence to Sequence Learning with Neural Networks的笔记和自己代码实践中的总结


The idea is originated for machine translation (I’m not sure how it plays out in other domains, e.g. chatbots). Think of the following scenario (borrowed from the original paper). You want to translate,

A B C -> alpha beta gamma delta

In this setting, we have to go through the full source sequence (ABC) before starting to predict alpha, where the translator might have forgotten about A by then. But when you do this as,

C B A -> alpha beta gamma delta

You have a strong communication link from A to alpha, where A is “probably” related to alpha in the translation.

Note: This entirely depends on your translation task. If the target language is written in the reverse order of the source language (e.g. think of translating from subject-verb-object to object-verb-subject language) to , I think it’s better to keep the original order.


[batch_size, 1, embedding_dim], 也就是max_seq_len为1.

Teacher forcing

The words in the decoder are always generated one after another, with one per time-step. We always use for the first input to the decoder, $y_1$, but for subsequent inputs, $y_{t>1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. This is called teacher forcing.


tensor([[ 2, 11, 4, 13, 3, 1, 1, 1, 1, 1, 1, 1],
[ 2, 11, 11, 7, 10, 9, 12, 13, 5, 3, 1, 1],