news 2026/4/23 15:27:25

“Attention Is All You Need” Paper Analysis (中英文对照精读分析)

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
“Attention Is All You Need” Paper Analysis (中英文对照精读分析)

“Attention Is All You Need” Paper Analysis

Introduction (引言)

English:The paper“Attention Is All You Need”introduced theTransformerarchitecture, a novel sequence transduction model that relies solely on attention mechanisms, eliminating recurrence and convolution entirely[1]. This model consists of an encoder-decoder structure built from self-attention and feed-forward sub-layers, enabling highly parallel computation and superior modeling of long-range dependencies. On machine translation benchmarks (WMT 2014 English-German and English-French), the Transformer achieved state-of-the-art results (28.4 BLEU for EN–DE, 41.0 BLEU for EN–FR) with significantly less training cost and time than previous RNN/CNN-based models[2]. This analysis provides a detailed bilingual examination of the Transformer’s architecture and components, experimental setup and results, advantages over earlier models, its influence on later models like BERT/GPT/T5, and an explanation of the key mathematical formulations in the paper.

中文:《Attention Is All You Need论文提出了Transformer架构,这是一种全新的序列转换模型,完全基于注意力机制,彻底摒弃了循环神经网络和卷积网络[1]。该模型采用编码器-解码器结构,由自注意力和前馈子层堆叠而成,因而能够高度并行计算,并且能够出色地建模长距离依赖关系。在机器翻译基准测试(WMT 2014英德和英法翻译任务)上,Transformer模型取得了当时的最新性能(英德28.4 BLEU分、英法41.0 BLEU分),同时训练成本和时间远低于以往基于RNN/CNN的模型[2]。本文将对Transformer的模型结构与各组件、实验设置与结果、相对于传统模型的优势、对后续模型(如BERT/GPT/T5)的影响,以及论文中关键数学公式的详细解释进行中英文对照的深入分析。

1. Model Architecture (模型结构详解)

Figure: The Transformer model architecture, consisting of an encoder (left) and a decoder (right). Each encoder layer (repeated $N=6$ times) has two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network, each wrapped with residual connection and layer normalization (“Add & Norm”). The decoder layers also have $N=6$ layers, with an additional sub-layer for encoder-decoder attention, and use masked self-attention in the first sub-layer to preserve autoregressive decoding.[3][4]

English:The Transformer is a sequence-to-sequence model built with a stack of identical layers in the encoder and decoder (six each in the original model)[4]. Eachencoderlayer has two core sub-layers: first amulti-head self-attentionmechanism, and second aposition-wise feed-forward network, with each sub-layer followed by a residual addition and layer normalization (often denoted as anAdd & Normstep)[3]. Thedecoderlayers have a similar structure with two differences: (1) each decoder layer includes athird sub-layerthat performs multi-head attention over the encoder’s output (encoder-decoder attention), and (2) the self-attention sub-layer in the decoder ismaskedto prevent attending to future positions, thereby ensuring the decoder’s outputs are generated autoregressively (each position can only depend on earlier positions)[4]. Additionally, the model uses learned inputembeddings(for source and target tokens) of dimension $d_{model}$, addspositional encodingto those embeddings to inject sequence order information, and employs a final linear projection plus softmax layer to produce output token probabilities[5][6].

中文:Transformer是一个编码器-解码器架构的序列到序列模型,原始模型中编码器和解码器各由6层相同的基本层堆叠而成[4]。每个编码器层包含两个核心子层:首先是多头自注意力机制,其次是逐位置前馈神经网络,并在每个子层后面应用残差连接和层归一化(通常称为“Add & Norm”步骤)[3]。解码器层的结构与编码器类似,但有两点差异:(1) 每个解码器层比编码器多一个第三子层,对编码器输出执行多头注意力(即编码器-解码器注意力);(2) 解码器中的自注意力子层使用掩膜机制,以避免关注将来的位置,确保解码器的输出按自回归方式生成(当前位置只能依赖之前的位置)[4]。此外,模型使用学习得到的输入嵌入(源语言和目标语言的词向量表示),将位置编码加入这些嵌入以注入序列次序信息,并在输出端通过线性变换和softmax层将解码器的输出映射为各个目标词的概率分布[5][6]。

Multi-Head Attention Mechanism (多头注意力机制)

English:Multi-Head Attentionis the central mechanism that allows the Transformer to attend to information from different representation subspaces and different positionsin parallel[7][8]. Instead of performing a single attention with full dimensionality, the Transformer uses $h$ attention “heads” simultaneously[7]. For each head, the input queries $Q$, keys $K$, and values $V$ (each of dimension $d_{\text{model}}$) arelinearly projectedinto a lower-dimensional subspace ($d_k$ for queries/keys, $d_v$ for values) using learned projection matrices $W_i^Q$, $W_i^K$, $W_i^V$ specific to each head[7]. Then each head independently computesscaled dot-product attention(described below) on these projected $Q_i, K_i, V_i$. The $h$ results (each $d_v$-dimensional) are then concatenated and projected again with another matrix $W^O$ to produce the final output of the multi-head attention layer[9]. This design allows each head to focus on different patterns or aspects of the input simultaneously[10] – for example, one head might attend to syntactic relations while another focuses on long-distance dependencies. By using multiple parallel attention heads, the model can jointly attend to information at different positions and from different representation subspaces, which a single-head attention would mix into one representation[8][10]. In the original Transformer, $h=8$ heads are used; each head has dimensionality $d_k = d_v = 64$, since $d_{\text{model}}=512$ and $512/8 = 64$. This means the total computation cost is similar to a single-head attention with full dimension, but with the benefit of diversity from multiple heads[11][9].

中文:多头注意力机制是Transformer的核心机制,使模型能够并行地从不同的表示子空间和不同位置获取信息[7][8]。Transformer并非用完整维度执行单一注意力,而是同时使用$h$个注意力“头”[7]。对于每一个头,首先将输入的查询矩阵$Q$、键矩阵$K$和值矩阵$V$(每项维度为$d_{\text{model}}$)通过各自的线性投影矩阵$W_i^Q$、$W_i^K$、$W_i^V$投影到低维子空间(查询/键投影维度为$d_k$,值投影维度为$d_v$)[7]。然后,每个头在投影后的$Q_i, K_i, V_i$上独立地计算缩放点积注意力(见下文对该数学形式的解释)。计算得到$h$个头各自的输出(每个为$d_v$维向量)后,将这些输出拼接(concat),再通过另一个投影矩阵$W^O$映射,产生多头注意力层的最终输出[9]。这种设计使每个注意力头可以同时关注输入中的不同模式或方面[10]——例如,一个头可以侧重于句法关系,另一个头关注远距离依赖。借助多个并行的注意力头,模型可以联合关注不同位置和不同表示子空间的信息,而单头注意力会把所有信息混合进单一表示中[8][10]。原始Transformer使用$h=8$个头;每个头的维度取$ d_k = d_v = 64$,因为$d_{\text{model}}=512$且$512/8 = 64$。这意味着总的计算成本与使用完整维度的单头注意力相当,但通过多个头的协同可以带来表示多样性和更强的表达能力[11][9]。

Positional Encoding (位置编码)

English:Because the Transformer has no recurrent or convolutional structure, it cannot natively infer the order of sequence elements – thus the model usesPositional Encodingto inject information about token positions[12]. The positional encodings are vectors of the same dimension $d_{\text{model}}$ as the embeddings, so that they can be added directly to the token embeddings at the bottoms of the encoder and decoder stacks[12]. In Vaswani et al.’s implementation, these encodings aredeterministic sinusoidal functionsof the position index: each position $pos$ is mapped to a vector where the $2i$-th dimension is $\sin(pos/10000^{2i/d_{model}})$ and the $(2i+1)$-th dimension is $\cos(pos/10000^{2i/d_{model}})$[13]. This means each embedding dimension corresponds to a sinusoid of a different frequency (wavelengths ranging from $2\pi$ to $10000\cdot 2\pi$)[14]. The rationale for this design was to allow the model tolearn relative positional relationshipseasily – for any fixed offset $k$, the positional encoding of $pos+k$ can be represented as a linear function of the encoding of $pos$, enabling the model to potentially generalize to sequence lengths longer than those seen in training[15]. Notably, the authors found that using learned positional embeddings yielded nearly identical results to these sinusoidal encodings[16]. They chose the sinusoidal version for simplicity and the hoped-for ability to extrapolate to longer sequences not seen during training[16].

中文:由于Transformer没有循环或卷积结构,本身无法获知序列元素的顺序信息,因此模型使用位置编码来注入关于序列位置的信息[12]。位置编码是与嵌入向量维度相同(均为$d_{\text{model}}$)的向量,能够直接加到编码器和解码器底层的词嵌入上[12]。在Vaswani等人的实现中,位置编码采用确定性的正弦和余弦函数来表示位置索引:对于序列位置$pos$,其编码向量的第$2i$维定义为$\sin(pos/10000^{2i/d_{model}})$,第$2i+1$维定义为$\cos(pos/10000^{2i/d_{model}})$[13]。也就是说,位置编码的每个维度对应于不同频率的正弦波(波长从$2\pi$到$10000\cdot 2\pi$按几何级数增长)[14]。这样设计的理由是使模型能够轻松学习相对位置关系——对于任意固定的偏移$k$,位置$pos+k$的编码可以表示为位置$pos$编码的线性函数,从而让模型有能力泛化到训练时未见过的更长序列[15]。作者还发现,使用可学习的位置嵌入与这种正弦位置编码的效果几乎相同[16]。他们最终选择正弦函数版本,主要是由于其实现简单,而且有望使模型能够外推到训练过程中未出现的更长序列[16]。

Position-Wise Feed-Forward Network (逐位置前馈网络)

English:In addition to the attention sub-layers, each layer in both the encoder and decoder contains aposition-wise feed-forward network(FFN) that further transforms each position’s representation[17]. This FFN is applied identically and independently to each position (hence “position-wise”), meaning it does not mix information across different sequence positions. It consists of two linear transformations with a ReLU activation in between[18]. Formally, for an input vector $x$ at a given position, the feed-forward sub-layer computes: $FFN(x) = \max(0, xW_1 + b_1)\,W_2 + b_2$[19]. The first linear layer expands the dimensionality from $d_{\text{model}}$ to an inner dimension $d_{ff}$ (in the original model $d_{ff}=2048$ when $d_{model}=512$), and the second linear layer projects it back down to $d_{\text{model}}$[20]. In practice, this is equivalent to two one-dimensional convolutions with kernel size 1. Each layer of the Transformer uses its own feed-forward network parameters (they vary from layer to layer)[21], and applying this non-linear transformation at every position helps the model to further process and mix the information extracted by the attention mechanisms.

中文:除了注意力子层之外,编码器和解码器的每一层还包含一个逐位置前馈神经网络(FFN),用于对每个位置的表示进行进一步变换[17]。这个前馈网络对序列中每个位置独立且相同地应用(因此称为“逐位置”),即它不在不同的序列位置之间混合信息。该网络由两个线性变换和中间一个ReLU激活函数组成[18]。形式上,对于给定位置的输入向量$x$,前馈子层计算:$FFN(x) = \max(0,\, xW_1 + b_1)\,W_2 + b_2$[19]。第一个线性层将向量维度从$d_{\text{model}}$扩大到内部维度$d_{ff}$(原始模型中,当$d_{\text{model}}=512$时取$d_{ff}=2048$),第二个线性层再将其投影回$d_{\text{model}}$[20]。在实现上,这相当于两个核大小为1的一维卷积。Transformer的每一层都有各自独立的前馈网络参数(不同层间不共享)[21]。在每个位置应用这种非线性变换,有助于模型进一步处理由注意力机制提取的信息并进行非线性组合。

Residual Connections and Layer Normalization (残差连接与层归一化)

English:The Transformer extensively usesresidual connectionsandlayer normalizationto facilitate training of the deep architecture[3]. Formally, for a given sub-layer (either a multi-head attention or the feed-forward), the output of the sub-layer is added to the original sub-layer input (this sum is theresidual connection), and then a layer normalization is applied to this sum[3]. ThisAdd & Normoperation can be written as: $\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))$, where $x$ is the input to the sub-layer and $\text{Sublayer}(x)$ is the function computed by the sub-layer[3]. The residual connection (first introduced by He et al. 2016[22]) helps mitigate the vanishing gradient problem and allows gradients to flow through the network more directly by providing an alternate path for the signal. Meanwhile,Layer Normalization(Ba et al. 2016) normalizes the summed output for each layer to have stable mean and variance, which accelerates convergence and stabilizes training in deep networks. In the Transformer,everysub-layer (attention or feed-forward) in the encoder and decoder is wrapped with a residual addition and a subsequent layer norm, and furthermore, all sub-layers and embedding layers produce outputs of dimension $d_{\text{model}}=512$ to ensure the residual sums are dimensionally compatible[23][24]. During training, a dropout (with rate 0.1 in the base model) is applied to the output of each sub-layerbeforeit is added to the residual connection and normalized, as a regularization technique[25].

中文:Transformer广泛使用了残差连接层归一化来辅助深层网络的训练[3]。形式上,对于任一子层(无论是多头注意力或前馈网络),先将该子层的输出与该子层的原始输入相加(该和即为残差连接),然后对相加结果进行层归一化[3]。这个Add & Norm操作可表示为:$\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))$,其中$x$是子层输入,$\text{Sublayer}(x)$是子层的输出函数[3]。残差连接(He等人2016年首次引入[22])通过为信号提供一条旁路,使梯度能够更直接地向前传递,缓解了深层网络中的梯度消失问题。层归一化(Ba等人2016年提出)则对每层的输出进行归一化处理,使其均值和方差稳定,有助于加速收敛并提升深层网络训练的稳定性。在Transformer中,编码器和解码器的每一个子层(注意力或前馈)都通过残差加法和后续的层归一化来封装;此外,为确保残差加法在维度上的匹配,所有子层以及嵌入层的输出维度均保持为$d_{\text{model}}=512$[23][24]。在训练过程中,Transformer还对每个子层输出在加到残差之前进行dropout随机失活(基本模型中失活率为0.1),然后再进行残差连接和归一化,以起到正则化的作用[25]。

2. Experiments and Results (实验设置与结果)

Data and Training Setup (数据集与训练设置)

English:The authors evaluated the Transformer on two major machine translation tasks:WMT 2014 English-to-GermanandWMT 2014 English-to-French. For English-German, they used the standard WMT14 dataset with about 4.5 million sentence pairs, applyingbyte-pair encoding (BPE)to obtain a shared source-target vocabulary of around 37,000 tokens[26]. For English-French, a much larger dataset of 36 million sentence pairs was used, with words segmented into a 32,000-size vocabulary using a word-piece model[26]. Sentence pairs were batched by similar length, and each training batch contained roughly 25k source tokens and 25k target tokens[27]. The Transformer was implemented in two model sizes: abase modeland abig model. TheTransformer (base)configuration had 6 layers, $d_{\text{model}}=512$, $d_{ff}=2048$, 8 heads ($d_k=d_v=64$), and dropout rate 0.1[28]. TheTransformer (big)configuration had 6 layers, $d_{\text{model}}=1024$, $d_{ff}=4096$, 16 heads ($d_k=d_v=64$), and a higher dropout rate 0.3 by default[29]. The base model has about 65 million parameters, whereas the big model is about 3× larger (approximately 213 million parameters)[29]. Models were trained on 8 NVIDIA P100 GPUs; for the base model, training took 100,000 steps (~12 hours) with each step processing a batch in 0.4 seconds[30]. The big model trained for 300,000 steps (~3.5 days) with each step ~1.0 second[31]. Optimization usedAdam(β₁=0.9, β₂=0.98) with a custom learning rate schedule: the learning rate increases linearly for 4000 warm-up steps and then decays proportionally to the inverse square root of the step number[32]. In practice, this schedule is given by $lr = d_{\text{model}}^{-0.5}\cdot \min(\text{step}^{-0.5},\, \text{step}\cdot \text{warmup}^{-1.5})$, which initially ramps up and then scales down[33]. They also employedlabel smoothingof 0.1 during training, which makes the model less confident of its predictions (slightly hurting perplexity but improving accuracy and BLEU score)[34]. Dropout was applied in various parts (as described above, and also to the input embeddings+positional encodings), and for regularization they also usedbeam searchdecoding (beam size 4) with a length penalty $\alpha=0.6$ during inference[35] (these hyperparameters were tuned on the validation set).

中文:作者在两个主要的机器翻译任务上评估了Transformer模型:WMT 2014英→德WMT 2014英→法翻译。对于英德任务,使用了标准的WMT14数据集(约450万对句子),并采用字节对编码(BPE来获得约37,000个词片的源-目标共享词汇表[26]。对于英法任务,使用了更大的包含3600万句对的数据集,并通过词片模型将单词划分,得到大小约32,000的词汇表[26]。在训练中,将长度相近的句对放入同一批次,每个批次大约包含25k个源语言token和25k个目标语言token[27]。Transformer采用了两种模型规模:基本模型(base大模型(bigTransformer (base)配置为6层、$d_{\text{model}}=512$、$d_{ff}=2048$、8个注意力头(每头维度$d_k=d_v=64$),Dropout率0.1[28]。Transformer (big)配置为6层、$d_{\text{model}}=1024$、$d_{ff}=4096$、16个注意力头(每头仍为$d_k=d_v=64$),默认较高的Dropout率为0.3[29]。基本模型约有6,500万参数,而大模型的参数量约为其3倍(约2.13亿参数)[29]。模型在8块NVIDIA P100 GPU上进行训练;基本模型训练100,000步(约12小时),每步处理一个批次约需0.4秒[30];大模型训练300,000步(约3.5天),每步约1.0秒[31]。优化器采用Adam(β₁=0.9,β₂=0.98),配合自定义的学习率调度策略:前4000步线性升高学习率,此后按照步数的负0.5次方比例下降[32]。具体公式为$lr = d_{\text{model}}^{-0.5}\cdot \min(\text{step}^{-0.5},\, \text{step}\cdot \text{warmup}^{-1.5})$,即先随步数升高后按$\text{step}^{-0.5}$衰减[33]。训练中还使用了标签平滑(系数0.1),这使模型对预测输出不过于自信(可能稍微降低困惑度,但提高了准确率和BLEU分)[34]。此外,在模型的各个部分应用了Dropout正则化(如前述的子层输出以及词嵌入和位置编码的和),推断阶段使用了集束搜索解码(集束宽度为4)并加上长度惩罚$\alpha=0.6$[35](这些超参数是在开发集上调优得到的)。

Baselines and Comparison Methods (对比方法及模型)

English:The Transformer models were compared against the previous state-of-the-art sequence models based on recurrent or convolutional architectures. Notable baselines included Google’sGNMT(Google Neural Machine Translation) which is an LSTM-based encoder-decoder with attention (with reinforcement learning fine-tuning)[36], and Facebook’sConvS2S(Convolutional Seq2Seq) model which uses convolutional layers in place of recurrence[37]. The authors also compared to theByteNetandDeep-Att + PosUnkmodels for context[38], as well as aMixture of Experts (MoE)model that was another contemporary approach[39]. Some of these baselines were ensemble models (using multiple models for inference) to achieve higher accuracy[40]. For example, the ConvS2S ensemble and GNMT+RL ensemble were among the top performers before the Transformer. These baselines had BLEU scores in the mid-20s on English-German (e.g. ~24.6 for GNMT, ~25.2 for ConvS2S single model; up to ~26.3 for their ensembles)[36]. On English-French, strong baselines achieved high BLEU in the upper 30s (e.g. ~40.5 for ConvS2S or MoE single models, and ~41.3 for the best ensemble)[36][41]. The Transformer (base and big) was evaluated against these to demonstrate the benefits of the new architecture.

中文:Transformer模型与先前基于循环或卷积架构的序列模型进行了对比。重要的基线包括谷歌的GNMT(Google神经机器翻译系统,基于LSTM的编码器-解码器架构并结合注意力机制,后续通过强化学习微调)[36]以及Facebook的ConvS2S(卷积序列到序列)模型,它以卷积层替代循环网络来建模序列[37]。作者还比较了ByteNet模型、“Deep-Att + PosUnk”模型等其它方法[38],以及一种混合专家模型(MoE),这是当时的另一种前沿尝试[39]。其中一些对比方法使用了模型集成(ensemble)的方式提升性能[40],例如ConvS2S集成模型和GNMT+RL集成模型在Transformer提出前属于性能最佳的系统。这些基线模型在英→德翻译上的BLEU成绩大多在20分中段(例如GNMT单模型约24.6,ConvS2S单模型约25.2;其集成模型可达约26.3)[36]。在英→法任务上,一些强势基线的BLEU达到了接近40分(例如ConvS2S或MoE单模型约40.5,最佳集成模型约41.3)[36][41]。Transformer(基本模型和大模型)与这些方法进行了对比评测,以证明新架构在性能上的优势。

Main Results and Analysis (主要结果与分析)

English:On the WMT14 English-German test set, the Transformerestablished a new state-of-the-art: the big Transformer achieved28.4 BLEU, outperforming the best previous results by over 2 BLEU[42]. Notably, this surpassed even ensembles of earlier models – for example, the previous best ensemble (ConvS2S) was around 26.4 BLEU[36][42], so a single Transformer model exceeded it. The base Transformer model (with 27.3 BLEU) already outperformed all prior single models and even those ensembles, despite using a fraction of the training cost[42]. On the WMT14 English-French task, the big Transformer reached41.0 BLEU, which was the highestsingle-modelscore at the time[43]. It slightly exceeded the best previous single model (around 40.5 BLEU) and was competitive with or better than previous ensembles[44][45]. What’s more, the training cost to reach these results was dramatically lower: for English-French, the Transformer’s result (41.0 BLEU) usedless than 1/4 of the training cost(in FLOPs) of the previous state-of-the-art model[43]. In absolute terms, training the Transformer big model took 3.5 days on 8 GPUs, whereas previous neural translation models often took significantly longer or required larger scale data to reach similar performance[46]. The authors also report that using checkpoint averaging (averaging the last few model checkpoints) and beam search with length penalty helped improve final translation quality modestly[47]. Overall, the results demonstrated the Transformer’ssuperior quality, training efficiency, and scalability: it achieved higher translation accuracy than RNN/CNN-based models, and did so with far less training time and resource usage[2][45]. These findings validated the paper’s premise that attention-centric models can be botheffective and efficientfor sequence transduction.

中文:在WMT14英→德测试集上,Transformer模型取得了新的最先进性能:大型Transformer模型达到了28.4 BLEU,比之前最好的结果高出超过2个BLEU点[42]。值得注意的是,这一成绩甚至超越了早期模型的集成结果——例如,此前最好的集成模型(ConvS2S)约为26.4 BLEU[36][42],而单个Transformer模型的性能就已超过了它。Transformer的基本模型也取得了27.3 BLEU,已经优于之前所有的单模型和集成模型,并且其训练开销只是之前模型的一小部分[42]。在WMT14英→法任务上,大型Transformer模型达到了41.0 BLEU,这是当时单模型的最高分[43]。该结果略微超过了此前最好的单模型(约40.5 BLEU),并且可以媲美甚至优于之前的集成模型表现[44][45]。更为可贵的是,Transformer达到这些结果所需的训练成本大幅降低:对于英→法任务,Transformer的大模型取得41.0 BLEU所消耗的训练计算量不到之前最优模型的1/4[43]。具体来说,训练Transformer大型模型在8块GPU上耗时3.5天,而以往的神经翻译模型往往需要显著更长的时间或更大的数据规模才能达到相近的性能[46]。作者还指出,通过检查点平均(对最后几次保存的模型取平均)以及在解码时使用带长度惩罚的集束搜索,能够进一步小幅提升翻译质量[47]。总体而言,实验结果展示了Transformer的卓越翻译质量、训练效率和可扩展性:它在翻译准确度上超越了基于RNN/CNN的模型,并且用更少的训练时间和资源达到了这一性能[2][45]。这些结果验证了论文的核心观点:以注意力机制为中心的模型在序列转换任务中可以同时兼具高效性有效性

3. Advantages over RNN/CNN Models (相对于传统RNN/CNN模型的优势)

English:The Transformer’s architecture confers several key advantages over traditional recurrent (RNN/LSTM) and convolution-based sequence models.Training efficiency and parallelismis a primary advantage: because the self-attention mechanism allows the model to consider all positions of a sequence at once (in a single layer) rather than sequentially, the Transformer can fully leverage parallel computation on GPUs[48]. In each self-attention layer, all token positions are processed simultaneously with matrix operations, requiring only $O(1)$ sequential steps (just one) for the entire layer[49]. In contrast, an RNN must process tokens one-by-one, taking $O(n)$ sequential operations for a sequence of length $n$[49]. This means training a Transformer can be much faster, especially for long sequences, since operations can be parallelized and the model better fits modern hardware like GPUs/TPUs which thrive on parallelism[50][49]. Another advantage is the ability to modellong-range dependencieseffectively. In an RNN, the dependency between positions far apart in the sequence has to travel through many time steps (creating a long path through the network), potentially causing information to dilute or gradients to vanish. In the Transformer, any two positions in a sequence can interact via self-attention inone step, making the path length between long-range dependencies dramatically shorter (constant $O(1)$ per layer)[49]. This short path length makes it easier for the model to learn relationships between distant words[51][49]. Empirically, the Transformer showed better handling of long sentences and captured global context more effectively than LSTM or CNN models. Additionally, in terms ofcomputational complexity per layer, self-attention has complexity $O(n^2 \cdot d)$ (where $n$ is sequence length and $d$ the model dimension) whereas a recurrent layer is $O(n \cdot d^2)$[52][49]. For typical sentence lengths $n$ that are smaller than the model dimension $d$, self-attention layers are actually more efficient to compute than RNN layers[49]. Convolutional models (like ConvS2S) can be parallelized and have $O(1)$ sequential steps as well, but they typically use fixed-size convolution kernels which give them a limited effective context per layer (e.g. CNNs require more layers or larger kernels to cover long-range dependencies)[53][54]. The Transformer’s self-attention, by comparison, is global (unrestricted in context within a layer) and thus more flexible in capturing dependencies at all distances. In summary, the Transformer offersfaster training through parallelism, better scaling to long sequences, and often a higher model capacity for a given computational cost, compared to earlier RNN/CNN-based approaches[2][49].

中文:Transformer架构相对于传统的循环网络(RNN/LSTM)和卷积网络模型有若干重要优势。首先在训练效率和并行性方面:由于自注意力机制允许模型在单层中一次性考虑序列的所有位置(而非按顺序逐步处理),Transformer能够充分利用GPU上的并行计算能力[48]。每一层自注意力可以同时处理序列中所有token位置,通过矩阵运算完成,对整个序列而言只需$O(1)$的顺序操作(单步完成)[49];相比之下,RNN对长度为$n$的序列需要$O(n)$的顺序处理步骤[49]。这意味着Transformer在训练长序列时速度更快,因为它的运算可以并行展开,非常契合现代GPU/TPU等硬件擅长并行计算的特性[50][49]。另一个优势是建模长距离依赖的能力更强。在RNN中,序列中相距很远的两个位置之间的依赖需要经过多次时间步传递(在网络中形成较长的路径),信息可能逐渐衰减,梯度也可能消失。而在Transformer中,序列中任意两个位置可以通过自注意力在一次计算中直接建立关联,因而长距离依赖之间的路径长度被显著缩短(每层仅为常数$O(1)$)[49]。更短的路径使模型更容易学习远距离单词之间的关系[51][49]。从经验上看,Transformer比LSTM或CNN模型更善于处理长句子,并能更有效地捕捉全局上下文。此外,就每层的计算复杂度而言,自注意力层为$O(n^2 \cdot d)$(其中$n$为序列长度,$d$为模型维度),而循环层约为$O(n \cdot d^2)$[52][49]。对于典型的句子长度$n$远小于模型维度$d$的情况,自注意力层实际上比循环层更高效[49]。卷积模型(如ConvS2S)在并行性上也很强,每层只需$O(1)$的顺序步骤,但通常使用固定大小的卷积核,使其每层覆盖的上下文范围有限(例如CNN需要增加层数或使用更大卷积核才能覆盖较长距离的依赖)[53][54]。相比之下,Transformer的自注意力在单层内就是全局的(不限制注意范围),因而在捕捉各个距离上的依赖关系方面更灵活。总而言之,与早期基于RNN/CNN的方法相比,Transformer提供了更快的训练速度(通过完全并行化)和对长序列更好的扩展能力,并且在相同计算成本下往往具有更高的模型表示能力[2][49]。

4. Impact on Subsequent Models: BERT, GPT, T5 (对后续模型的启发与影响)

English:The Transformer architecture has profoundly influenced almost all modern NLP models. Subsequent breakthrough models likeBERT,GPT, andT5are all built upon the Transformer’s attention-based design, with extensions or modifications suited to their specific goals:

中文:Transformer架构对现代NLP模型产生了深远影响。后续出现的诸多突破性模型,如BERTGPTT5等,都是在Transformer的注意力机制架构基础上扩展或改进而来:

5. Mathematical Formulations and Derivations (数学公式与推导解释)

Scaled Dot-Product Attention (缩放点积注意力)

English:The fundamental building block of the Transformer’s attention mechanism is theScaled Dot-Product Attention. In this operation, we have three matrices: thequery$Q \in \mathbb{R}^{n_q \times d_k}$,key$K \in \mathbb{R}^{n_k \times d_k}$, andvalue$V \in \mathbb{R}^{n_k \times d_v}$, where $n_q$ is the number of query vectors (often the target sequence length for decoder self-attention, or the same as $n_k$ for self-attention), $n_k$ is the number of key/value vectors (often the source sequence length for encoder-decoder attention or same sequence length for self-attention), and $d_k$, $d_v$ are their vector dimensions. Theattention scorematrix is computed by taking the dot product of each query with each key: $QK^T$, which yields an $n_q \times n_k$ matrix of raw scores indicating how well each query aligns with each key[59]. These scores are thenscaledby $\frac{1}{\sqrt{d_k}}$ – this is done to prevent the dot products from growing too large in magnitude as $d_k$ increases, which could push the softmax into extremely small gradients regions[60]. After scaling, a softmax is applied to each row (for each query) to produce a probability distribution over the $n_k$ keys. This gives theattention weightsfor each value. Finally, these weights are used to compute a weighted sum of the value vectors $V$. The complete formula for the output is:

$$ \text{Attention}(Q, K, V) = \mathrm{softmax}!\Big(\frac{Q K^T}{\sqrt{d_k}}\Big)\, V\,, $$

which results in an $n_q \times d_v$ matrix as the attended output[61]. In practice, $Q$, $K$, and $V$ are often represented as $[batch \times sequence \times dim]$ tensors, and the computation is done in parallel for all queries in the batch. The scaling by $\sqrt{d_k}$ is crucial – as noted by the authors, without scaling, the dot products grow with $d_k$ and the softmax can saturate, while with scaling, the variance of the input to softmax remains more stable[62]. The softmax ensures that each query’s attention weights on all keys sum to 1, forming a convex combination of values. Thisscaled dot-product attentionis the core operation used in all attention layers of the Transformer (in self-attention as well as encoder-decoder attention).

中文:Transformer注意力机制的基础单元是缩放点积注意力。在该操作中,我们处理三个矩阵:查询矩阵 $Q \in \mathbb{R}^{n_q \times d_k}$、矩阵 $K \in \mathbb{R}^{n_k \times d_k}$ 和矩阵 $V \in \mathbb{R}^{n_k \times d_v}$。其中 $n_q$ 是查询向量的数量(对于解码器自注意力通常是目标序列长度,或在自注意力情况下与$n_k$相同),$n_k$ 是键/值向量的数量(对于编码器-解码器注意力通常是源序列长度,或在自注意力情况下与序列长度相同),$d_k$和$d_v$分别是键(查询)向量和值向量的维度。首先通过 $QK^T$ 计算注意力分数矩阵,即逐元素计算每个查询向量与每个键向量的点积,得到一个 $n_q \times n_k$ 的矩阵,表示每个查询和每个键的匹配程度[59]。然后将该分数除以$\sqrt{d_k}$进行缩放——这样做是为了解决随着$d_k$增大点积值变大的问题,避免softmax输入值过大导致梯度极小[60]。缩放之后,对每个查询对应的分数行应用softmax,得到针对每个键的注意力权重(概率分布)。最后,用这些权重对值矩阵$V$的各值向量进行加权求和,得到输出。完整的公式为:

$$ \text{Attention}(Q, K, V) = \mathrm{softmax}!\Big(\frac{Q K^T}{\sqrt{d_k}}\Big)\, V\,, $$

其输出是一个 $n_q \times d_v$ 的矩阵,即加权后的值的集合[61]。在实现中,$Q$、$K$、$V$通常表示为形如[批大小 $\times$ 序列长度 $\times$ 向量维度]的张量,上述计算会对一批中所有查询并行完成。$\frac{1}{\sqrt{d_k}}$的缩放非常关键——正如作者指出,如果不进行缩放,点积会随着$d_k$增大而增大,导致softmax可能进入梯度饱和区域;采用缩放可以使进入softmax的值的方差保持稳定[62]。Softmax保证了每个查询对所有键的注意力权重之和为1,从而对值向量形成凸组合。这一缩放点积注意力就是Transformer所有注意力层(无论自注意力还是编码器-解码器注意力)的核心运算。

Multi-Head Attention – Matrix Form (多头注意力的矩阵形式)

English:Building on the single-head attention defined above, the Transformer usesMulti-Head Attentionto allow the model to attend to multiple subspaces of the input simultaneously. In matrix form, the multi-head attention can be expressed compactly. For each of the $h$ heads, we have separate learned projection matrices $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, and $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ that project the input vectors into “head $i$’s” query, key, and value subspaces[63]. We also have an output projection $W^O \in \mathbb{R}^{(h \cdot d_v) \times d_{\text{model}}}$ for combining the heads’ outputs[63]. Denoting the input (for example, the sequence of vectors from the previous layer) as $X$ (dimension $n \times d_{\text{model}}$ for $n$ positions), the multi-head attention is:

$$ \begin{aligned} \text{head}_i &= \text{Attention}(X W_i^Q,\; X W_i^K,\; X W_i^V)\,, \quad i=1,\dots,h, \ \text{MultiHead}(X) &= \text{Concat}(\text{head}_1,\dots,\text{head}_h)\; W^O\,. \end{aligned} $$

In words, each head $i$ takes the input $X$ and computes its own queries $Q_i = X W_i^Q$, keys $K_i = X W_i^K$, and values $V_i = X W_i^V$[9]. Then it performs scaled dot-product attention to produce $\text{head}i$ (an $n \times d_v$ matrix). The $h$ head results are concatenated along the feature dimension (resulting in an $n \times (h \cdot d_v)$ matrix) and then multiplied by $W^O$ (of size $h d_v \times d$ in the base model)[11]. This configuration keeps the computation cost roughly the same as a single head (since $8$ heads of dimension $64$ have the same total vector size as one head of $512$)[11]. Multi-head attention thus provides the Transformer with the ability to look at the input from several representation subspaces at once, greatly enriching its modeling capacity beyond what a single attention head could do.}}$) to bring the dimension back to $d_{\text{model}}$[9]. This yields the final $n \times d_{\text{model}}$ output of the multi-head attention sub-layer. Importantly, the projections $W_i^Q, W_i^K, W_i^V$ and $W^O$ are learned parameters. By using multiple heads, the model effectively has $h$ different learned transformations of the input, and each head’s scaled dot-product attention can focus on different aspects (patterns or positions) of the sequence[8][10]. The concatenation and final linear layer $W^O$ then mix these diverse attentions together. The paper’s default hyperparameters set $h=8$ heads, with $d_k = d_v = 64$ (so $h \cdot d_v = 512 = d_{\text{model}

中文:在上述单头注意力的基础上,Transformer使用多头注意力使模型能够同时关注输入的多个子空间。用矩阵形式可以简洁地表示多头注意力机制。对于$h$个注意力头,每个头都有各自独立学习的投影矩阵:$W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$、$W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$、$W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$,用于将输入向量投影到第$i$个头的查询、键、值子空间[63]。另外还有一个输出投影矩阵 $W^O \in \mathbb{R}^{(h \cdot d_v) \times d_{\text{model}}}$,用于将多头的输出合并映射回模型维度[63]。记输入(例如前一层输出的序列向量)为 $X$,其尺寸为 $n \times d_{\text{model}}$($n$是序列长度),则多头注意力的表达为:

$$ \begin{aligned} \text{head}_i &= \text{Attention}(X W_i^Q,\; X W_i^K,\; X W_i^V)\,, \quad i=1,\dots,h, \ \text{MultiHead}(X) &= \text{Concat}(\text{head}_1,\dots,\text{head}_h)\; W^O\,. \end{aligned} $$

用语言描述,即每个头$i$将输入$X$分别左乘自己的投影矩阵计算出查询矩阵$Q_i = X W_i^Q$、键矩阵$K_i = X W_i^K$和值矩阵$V_i = X W_i^V$[9]。然后,对这些$Q_i, K_i, V_i$执行缩放点积注意力运算,得到该头的输出$\text{head}i$(一个 $n \times d_v$矩阵)。接着,将$h$个头的输出在特征维度上拼接(得到 $n \times (h \cdot d_v)$的矩阵),再乘以输出矩阵 $W^O$(尺寸为$h d_v \times d$)[11]。这种配置保证总的向量维度与单头注意力相同(因为8个维度64的头的总维度与1个维度512的头相等),计算成本也近似不变[11]。多头注意力使Transformer能够同时从多个表示子空间考察输入,大大增强了模型的表达能力,超出了单一注意力头所能达到的效果。}}$),将维度投影回$d_{\text{model}}$[9]。结果就是多头注意力子层的最终输出,维度为 $n \times d_{\text{model}}$。需要强调的是,$W_i^Q, W_i^K, W_i^V$以及$W^O$都是训练中学得的参数。通过使用多个注意力头,模型相当于对输入进行了$h$种不同的线性变换,每个头的缩放点积注意力可以关注序列的不同方面(不同的模式或位置)[8][10]。最后的拼接及线性变换$W^O$再将这些多样的注意力信息进行整合。论文中的默认超参数是$h=8$头,每个头使用$d_k = d_v = 64$(因此在基本模型中$h \cdot d_v = 512 = d_{\text{model}

Positional Encoding – Formula Construction (位置编码公式构造)

English:ThePositional Encodingin the Transformer provides a deterministic way to encode token positions with sinusoidal functions. The formula specified in the paper for the positional encoding of position $pos$ in the embedding dimension $i$ is:

$$ \begin{aligned} PE(pos, 2i) &= \sin!\Big( pos / 10000^{\,2i/d_{\text{model}}} \Big), \ PE(pos, 2i+1) &= \cos!\Big( pos / 10000^{\,2i/d_{\text{model}}} \Big), \end{aligned} $$

for $0 \le 2i < d_{\text{model}}$[13]. This means that each even-indexed dimension of the positional encoding vector is given by a sinusoid and each odd-indexed dimension is given by a cosine, with the wavelength of these sinusoids increasing as the dimension index $i$ increases. To elaborate, consider the term $10000^{\,2i/d_{\text{model}}}$: for $i=0$, this term is $10000^0 = 1$, so the sine/cosine has a period of $2\pi$ (since $\sin(pos/1)$ has period $2\pi$ in $pos$). For $i = 1$ (meaning the second even dimension, index 2 in 0-based counting), the term is $10000^{2/d_{\text{model}}}$, so the frequency is higher (period is smaller). For the largest $i$ (near $d_{\text{model}}$), the term $10000^{2i/d_{\text{model}}}$ becomes $10000^{2(d_{\text{model}}/2)/d_{\text{model}}} = 10000^1 = 10000$, so the slowest varying sinusoid has a period of $2\pi \cdot 10000$[14]. In effect, this scheme produces a set of basis sinusoids ranging from high frequency to low frequency. Any particular position $pos$ will have a unique combination of sine and cosine values across the dimensions, which the model can use to infer position or relative positions. An important property mentioned in the paper is that these positional encodings allow the model to easily learn to attend by relative position. For a fixed offset $k$, the positional encoding of $(pos+k)$ can be expressed as a linear function of the encoding of $pos$[64]. This is because $\sin((pos+k)/M)$ and $\cos((pos+k)/M)$ can be expanded via angle addition formulas in terms of $\sin(pos/M)$, $\cos(pos/M)$ multiplied by constants (which depend on $k$). As a result, the self-attention layers could potentially learn to generalize to sequence lengths longer than those seen in training by exploiting this linear relationship[15]. The authors also tried learned positional embeddings (treating position as a trainable vector) and found it performed similarly[16], but they chose the sinusoidal formula as it might offer better generalization to longer sequences and it adds no additional learned parameters[65].

中文:Transformer中的位置编码使用正弦函数为每个位置提供确定性的表示。论文中给出的针对位置$pos$在位置编码向量第$i$维的公式为:

$$ \begin{aligned} PE(pos, 2i) &= \sin!\Big( \frac{pos}{10000^{\,2i/d_{\text{model}}}} \Big), \ PE(pos, 2i+1) &= \cos!\Big( \frac{pos}{10000^{\,2i/d_{\text{model}}}} \Big), \end{aligned} $$

其中$0 \le 2i < d_{\text{model}}$[13]。也就是说,位置编码向量中索引为偶数的维度由正弦函数给出,索引为奇数的维度由余弦函数给出,而且随着维度索引$i$的增加,这些正弦/余弦的波长(周期)按比例增长。具体而言,考虑项$10000^{\,2i/d_{\text{model}}}$:当$i=0$时,该项为$10000^0 = 1$,因此此维度采用$\sin(pos/1)$(周期为$2\pi$);当$i=1$(即第二个偶数维度,按0起索引是索引2)时,该项为$10000^{2/d_{\text{model}}}$,频率更高(周期更短);而对于最大的$i$(接近$d_{\text{model}}$),$10000^{2i/d_{\text{model}}}$趋近于$10000^{2(d_{\text{model}}/2)/d_{\text{model}}} = 10000^1 = 10000$,对应的正弦函数变化最慢(周期约$2\pi \cdot 10000$)[14]。实际上,这样的设计产生了一组从高频到低频的基正弦波。任何给定的位置$pos$都会在各个维度上得到一组唯一的正弦和余弦值组合,模型可以利用这些值来推断绝对位置或相对位置。论文中特别指出,这种位置编码允许模型轻松学习按相对位置来关注。对于固定的偏移$k$,位置$(pos+k)$的编码可以表示为位置$pos$编码的线性函数[64]。这是因为$\sin((pos+k)/M)$和$\cos((pos+k)/M)$可以通过和角公式表示为$\sin(pos/M)$、$\cos(pos/M)$的线性组合(系数取决于$k$)。因此,自注意力层有可能利用这种线性关系来泛化到训练时未见过的更长序列[15]。作者还尝试了将位置作为可训练向量的可学习位置嵌入,发现与正弦位置编码效果几乎相同[16]。但他们最终选择了这种正弦公式,一方面因为它可能在泛化到更长序列时表现更好,另一方面这种方法不增加任何需要学习的参数[65],实现简单优雅。

References:Vaswani, A., Shazeer, N., Parmar, N., et al. “Attention Is All You Need.”NeurIPS 2017[2][42]. Other sources are cited inline throughout this analysis.


[1] [2] [3] [4] [5] [6] [7] [8] [9] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [49] [51] [52] [53] [54] [60] [62] [63] [64] [65] Attention is All you Need

https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf

[10] [48] [50] [59] Attention Is All You Need - Wikipedia

https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

[55] [56] [57] [58] Transformer Models Compared: BERT vs GPT vs T5 Guide

https://www.devstree.com/transformer-models-use-case-guide-bert-gpt-t5/

[61] I Finally Understood “Attention is All You Need” After So Long. Here’s How I Did It. | by Olubusolami Sogunle | Artificial Intelligence in Plain English

https://ai.plainenglish.io/i-finally-understood-attention-is-all-you-need-after-so-long-heres-how-i-did-it-263b46273f9f?gi=42d9b414278a

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/19 12:46:05

云原生安全:Falco 容器运行时监控

随着云原生技术的飞速发展&#xff0c;容器化部署已成为企业应用交付的主流方式。但容器的轻量级、动态化特性也带来了全新的安全挑战——传统的主机级安全工具难以适配容器的隔离环境&#xff0c;而容器镜像漏洞、运行时权限滥用、逃逸攻击等风险时刻威胁着业务安全。在众多云…

作者头像 李华
网站建设 2026/4/23 11:28:11

LobeChat网络安全等级保护方案

LobeChat网络安全等级保护方案 在企业加速推进数字化转型的今天&#xff0c;AI聊天系统正逐步从“锦上添花”的辅助工具演变为业务流程中的关键交互节点。尤其是在金融、政务、医疗等高敏感领域&#xff0c;一个看似简单的对话界面背后&#xff0c;可能涉及用户身份信息、内部…

作者头像 李华
网站建设 2026/4/23 14:12:50

EmotiVoice资源占用优化:在普通GPU上流畅运行

EmotiVoice资源占用优化&#xff1a;在普通GPU上流畅运行 在一台搭载RTX 3060、显存仅12GB的笔记本电脑上&#xff0c;能否实时生成带有情感色彩的定制化语音&#xff1f;对于许多开发者而言&#xff0c;这曾是一个奢望。高端语音合成模型动辄需要A100级别的算力支持&#xff0…

作者头像 李华
网站建设 2026/4/23 11:30:48

语音合成+大模型?EmotiVoice与LLM融合应用设想

语音合成与大模型的融合&#xff1a;让AI“有情有感”地说话 在智能助手越来越常见的今天&#xff0c;我们早已习惯了用手机发问&#xff1a;“明天会下雨吗&#xff1f;”“帮我设个闹钟”。但有没有觉得&#xff0c;这些回答虽然准确&#xff0c;却总少了点温度&#xff1f;就…

作者头像 李华
网站建设 2026/4/23 11:34:39

EmotiVoice语音合成在远程办公会议中的辅助作用

EmotiVoice语音合成在远程办公会议中的辅助作用 在一场跨时区的线上会议中&#xff0c;三位团队成员分别身处北京、柏林和旧金山。会议结束后&#xff0c;一位因时差问题未能参会的同事收到了一封邮件&#xff1a;“您有一条新的语音纪要&#xff0c;请点击播放。”按下按钮后&…

作者头像 李华