位置编码(Position Encoding)是Transformer对”序列”进行建模的关键。本文将系统介绍Transformer架构中位置编码的演化路径以及优缺点，并手推各种位置编码的具体实现。

编码类型	经典模型
Learned Positional Embedding	BERT, Roberta, GPT, GPT2
Sinusoidal	Transformer, Transformer XL
RoPE	RoFormer, GPT-Neo, LLaMA, ChatGLM, Baichuan7B
ALiBi	BLOOM, BloombergGPT, Baichuan13B

Learned Positional Embedding

方法

通过可学习的Positional Embedding来编码位置信息，是预训练语言模型中最广泛的编码方式。BERT就是这种方式，后续的一系列工作Roberta，GPT2等也都是采用这种方式。

这种方案直接对不同的位置随机初始化一个postion embedding，然后与word embedding相加后输入模型。postion embedding作为模型参数的一部分，在训练过程中进行更新。如下图所示：

实现

这种方案的实现最为简单，只需要初始化一个可学习的position embedding即可，纬度为（最大可编码的长度，隐层大小）。

# 初始化
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)

# 计算
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings

分析

对BERT的position embedding进行可视化分析，具体如下：

可以看到开头和最后的位置，即0和511独立在一个角落。剩下的位置(1-510)都会连在一起，构成一个不断的线条。这表明每个位置都会被区分开，并且相邻的位置比较接近。

当然这个方案的问题也非常明显，不具备外推的性质。长度在预设定好之后就被固定了。

Sinusoidal

方法

基于Sinusoidal的位置编码是最初Tranformer文章(Attention is All You Need)中的方案。具体计算方式如下所示:

通过sin和cos函数直接给出了每个位置每个维度上的绝对数值，可以认为直接初始化并固定了position embedding。

作者对比了学习式的position Embedding，结论是与Sinusoidal的效果相近，但是Sinusoidal有更好的外推性，无需设定最大长度max_position。

实现

class SinusoidalPositionEncoding(torch.nn.Module):
    def __init__(self, dim, max_period=10000):

        assert dim % 2 == 0
        self.dim = dim

        self.max_period = max_period

        super().__init__()

        w_arr = torch.arange(0, self.dim // 2)
        w_arr = 1 / (max_period) ** (w_arr * 2 / dim)

        self.register_buffer("w_arr", w_arr)

    def forward(self, x):
        """
        assume x has shape (B,T) where B=batch size, T=token size(or sequence length)
        and values of x are integers >=0.
        """

        _x = torch.unsqueeze(x, -1)  # (B,T,1)

        v = _x * self.w_arr  # (B,T,dim//2)

        sin = torch.sin(v)
        sin = torch.unsqueeze(sin, -1)  # (B,T,m,1)

        cos = torch.cos(v)
        cos = torch.unsqueeze(cos, -1)  # (B,T,m,1)

        m = torch.cat([sin, cos], -1)  # (B,T,m,2)

        b, t, _, _ = m.shape

        y = m.reshape(b, t, -1)  # (B,T,dim) where 2m=`dim`

        return y

分析

与学习式对比

最初的Transformer中采用的是Sinusoidal编码，但是BERT以及后续的预训练模型都采用Position Embedding。对比分析可能有下面一些原因：

实现上Position Embedding相对更简单
BERT是在更大规模的数据上训练，数据量是Transformer的40倍。在更多数据上的效果可能就不一样。
Transformer中是机器翻译的任务，这时候encoder的核心是提取整个句子的语义，不需要特别关注词的具体位置。但是BERT这样的NLU模型是需要的，尤其对于下游的序列标注等任务，需要对位置进行更精细的建模。这时候完全训练获得的embedding要比直接赋值更好。

周期

分析周期可以发现，维度i的周期为 $N^{(i/d)} * 2\Pi$，这里$ 0 <= i < d$，所以周期的范围是 $2\Pi$ 到 $N* 2\Pi$ 。

固定维度d为500，绘制不同N下的position embedding，具体如下:

可以看到随着N的增大，周期会明显变长。文章中的N为10000，作者没有具体的解释，猜测可能是为了能让周期是一个很大的数，更好的区分开每个位置。

相对位置

Sinusoidal可以学习到相对位置，对于固定位置距离的k，PE(i+k)可以表示成PE(i)的线性函数。证明如下：

通过三角公式进行展开，其中u,v为关于k的常数，所以可以证明PE(i+k)可以由PE(i)的线性表示。

点积

Attention中的重要操作就是点积。计算两个位置的点积 $PE(i+k)PE(i)$ :

通过三角公式进行展开，最终的结果是关于k的一个常数。这表明两个位置向量的点积只和位置差k有关，可以表示两个位置的相对距离。

通过上面的计算很容易可得 $PE(t+k)PE(k) = PE(t)PE(t-k)$，这表明Sinusoidal的编码具有对称性。

同时可以可以发现，随着k的增加，点积的结果会直接减少，即会存在远程衰减。如下图所示:

但是在实际的Attention计算中还需要与attention的权重 $W$ 进行相乘，即 $ P(t)^TW_{Q}^{T}P(t+k)W_{K} $，这时候点积的结果就不能反映距离了。如下所示:

具体可以参见复旦大学对Sinusoidal编码的分析[1]。

RoPE

RoPE旋转式位置编码是苏神在RoFormer[2]中提出的，也是目前大模型广泛采用的一种位置编码方式。这种编码不是作用在embedding的输入层，而是作用在与Attention的计算中。

位置m的位置权重 $R_{m}$ 是一个二维的矩阵，具体如下所示:

与位置m的词向量x相乘就是对x注入了位置信息，因为 $R_{m}$ 的矩阵性质，相乘操作可以转换成两个相加的操作:

如果按照维度两两一组拆分来看，相当于对相邻的两个维度(x1,x2)施加了一个旋转的操作，所以也称为旋转式编码：

所以 RoPE 的 self-attention 操作的流程是，对于 token 序列中的每个词嵌入向量，首先计算其对应的 query 和 key 向量，再对每个 token 位置都计算对应的旋转矩阵，接着对每个 token 位置的 query 和 key 向量的元素按照两两一组应用旋转变换，最后再计算 query 和 key 之间的内积得到 self-attention 的计算结果。

实现

https://discuss.huggingface.co/t/is-llama-rotary-embedding-implementation-correct/44509

可直接参考huggingface中LLaMA的实现，包含了cache的加速：

class LlamaRotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        # theta
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq, persistent=False)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
        # 外积, m*theta
        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)

    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
        )

def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

# attention部分
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

分析

动机

苏神自己写的动机是根据欧拉公式的性质，如下所示，相乘后可以表达成差的形式，所以可能可以实现相对位置编码。

推导

先假设现在只有2维，则位置m处的token可以表示成2维的向量q，可以用一个复数坐标系下进行计算：

进一步定义函数f，为query增加位置编码的函数，这里直接相乘 $e^{im\theta}$，如下:

可以看到就相当于对q施加了 $m\theta$ 角的旋转。

可以进一步证明，两个不同的位置m,n，施加位置编码之后的内积只和相对位置(m-n)有关:

只要两两分组即可，并且每组的 $\theta$ 是不同的，就可以推广到d维的场景。

也可以进一步证明，随着相对位置的增加，点积后的结果会逐渐减少，呈现远端衰减。

与Sinusoidal对比

引用EleutherAI的观点，相比于Sinusoidal，RoPe是作用于一对位置，而非单个位置；RoPe是乘性而非加性，更符合Attention的计算方式。

ALiBi

Attention with Linear Biases (ALiBi)[3]是一种专门提高transformer模型外推能力的编码，让模型在训练阶段的上下文不是很长，但是在预测阶段可以支持很长的上下文。

具体而言，在attention计算之后，会添加一个固定的偏置项，这个偏置项是固定的不需学习的，只跟相对位置有关。如下所示:

对于attention score会根据距离添加“惩罚”，距离越大，惩罚也越大。不同的attention head施加的惩罚系数m是不同。具体而言，假设一共有n个head，系数m是一个等比数列，始于 $2^{-8/n}$，终于 $2^{-8}$。
这是作者实验之后得倒的经验结论。

这种方式的计算和显存占用也都是最优的，因为不需要给query和key都施加位置编码，只需要在内积之后添加一个偏置即可。

实现

可以参考huggingface中BLOOM中的实现

def build_alibi_tensor(attention_mask: torch.Tensor, num_heads: int, dtype: torch.dtype) -> torch.Tensor:
    """
    Link to paper: https://arxiv.org/abs/2108.12409 Alibi tensor is not causal as the original paper mentions, it
    relies on a translation invariance of softmax for quick implementation: with l being a tensor, and a fixed value
    `softmax(l+a) = softmax(l)`. Based on
    https://github.com/ofirpress/attention_with_linear_biases/blob/a35aaca144e0eb6b789dfcb46784c4b8e31b7983/fairseq/models/transformer.py#L742
    TODO @thomasw21 this doesn't work as nicely due to the masking strategy, and so masking varies slightly.

    Args:
    Returns tensor shaped (batch_size * num_heads, 1, max_seq_len)
        attention_mask (`torch.Tensor`):
            Token-wise attention mask, this should be of shape (batch_size, max_seq_len).
        num_heads (`int`, *required*):
            number of heads
        dtype (`torch.dtype`, *optional*, default=`torch.bfloat16`):
            dtype of the output tensor
    """
    batch_size, seq_length = attention_mask.shape
    closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
    base = torch.tensor(
        2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
    )
    powers = torch.arange(1, 1 + closest_power_of_2, device=attention_mask.device, dtype=torch.int32)
    slopes = torch.pow(base, powers)

    if closest_power_of_2 != num_heads:
        extra_base = torch.tensor(
            2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
        )
        num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
        extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=attention_mask.device, dtype=torch.int32)
        slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)

    # Note: alibi will added to the attention bias that will be applied to the query, key product of attention
    # => therefore alibi will have to be of shape (batch_size, num_heads, query_length, key_length)
    # => here we set (batch_size=1, num_heads=num_heads, query_length=1, key_length=max_length)
    # => the query_length dimension will then be broadcasted correctly
    # This is more or less identical to T5's relative position bias:
    # https://github.com/huggingface/transformers/blob/f681437203baa7671de3174b0fa583c349d9d5e1/src/transformers/models/t5/modeling_t5.py#L527
    arange_tensor = ((attention_mask.cumsum(dim=-1) - 1) * attention_mask)[:, None, :]
    alibi = slopes[..., None] * arange_tensor
    return alibi.reshape(batch_size * num_heads, 1, seq_length).to(dtype)

分析

外推性

固定训练阶段的最大长度是512或者1024，在预测阶段，ALiBi位置编码可以保持非常稳定的外推性。如下所示:

思考

如果从零设计一个大模型，最佳实践是用什么position encoding其实还没有定论，Rotary和ALiBi目前看都是可尝试的对象。

参考

[1] TENER: Adapting Transformer Encoder for Named Entity Recognition
[2] RoFormer: Enhanced Transformer with Rotary Position Embedding
[3] TRAIN SHORT, TEST LONG: ATTENTION WITH LINEARBIASES ENABLES INPUT LENGTH EXTRAPOLATION
[2] (知乎)为什么BERT不使用Sinusoidal：https://www.zhihu.com/question/307293465

NLP

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

大模型基础组件 - LayerNorm 上一篇

垂直领域大模型的思考下一篇

大模型基础组件 - Position Encoding