本文是Seq2Seq的最经典论文Sequence to Sequence Learning with Neural Networks的笔记和自己代码实践中的总结
The idea is originated for machine translation (I’m not sure how it plays out in other domains, e.g. chatbots). Think of the following scenario (borrowed from the original paper). You want to translate,
A B C -> alpha beta gamma delta
In this setting, we have to go through the full source sequence (ABC) before starting to predict alpha, where the translator might have forgotten about A by then. But when you do this as,
C B A -> alpha beta gamma delta
You have a strong communication link from A to alpha, where A is “probably” related to alpha in the translation.
Note: This entirely depends on your translation task. If the target language is written in the reverse order of the source language (e.g. think of translating from subject-verb-object to object-verb-subject language) to , I think it’s better to keep the original order.
[batch_size, 1, embedding_dim], 也就是max_seq_len为1.
The words in the decoder are always generated one after another, with one per time-step. We always use
tensor([[ 2, 11, 4, 13, 3, 1, 1, 1, 1, 1, 1, 1],
[ 2, 11, 11, 7, 10, 9, 12, 13, 5, 3, 1, 1],