2.1 编码器架构详解

文档摘要

2.1 编码器架构详解 — Transformers编码器核心组件解析本节导读：学完本节将深入理解Transformers编码器的架构设计，掌握Multi-Head Attention、Feed Forward Network、Layer Normalization等核心组件的工作原理和实现细节。学习目标理解编码器在Transformers模型中的定位和作用掌握Multi-Head Attention机制的数学原理和实现方法深入理解Feed Forward Network的架构设计和工作原理掌握Layer Normalization、Dropout等辅助组件的作用能够独立实现完整的编码器模块并进行性能优化核心概念

2.1 编码器架构详解 — Transformers编码器核心组件解析

本节导读：学完本节将深入理解Transformers编码器的架构设计，掌握Multi-Head Attention、Feed Forward Network、Layer Normalization等核心组件的工作原理和实现细节。

学习目标

理解编码器在Transformers模型中的定位和作用
掌握Multi-Head Attention机制的数学原理和实现方法
深入理解Feed Forward Network的架构设计和工作原理
掌握Layer Normalization、Dropout等辅助组件的作用
能够独立实现完整的编码器模块并进行性能优化

核心概念

Transformers编码器由多层相同的编码器层堆叠而成，每层包含两个核心子层：Multi-Head Self-Attention机制和Feed Forward Network，以及Layer Normalization和残差连接。

环境准备 / 前置知识


# 安装必要的依赖库
!pip install torch transformers numpy matplotlib seaborn

# 导入必要的库
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional, Tuple
import math

分步实战

步骤 1：编码器整体架构设计


class EncoderConfig:
    """编码器配置类"""
    
    def __init__(self, vocab_size=30522, d_model=512, n_heads=8, 
                 d_ff=2048, n_layers=6, max_seq_len=512, 
                 dropout=0.1, layer_norm_eps=1e-12):
        self.vocab_size = vocab_size
        self.d_model = d_model          # 模型维度
        self.n_heads = n_heads          # 注意力头数
        self.d_ff = d_ff                # 前馈网络维度
        self.n_layers = n_layers        # 编码器层数
        self.max_seq_len = max_seq_len  # 最大序列长度
        self.dropout = dropout
        self.layer_norm_eps = layer_norm_eps
        
        # 计算每个注意力头的维度
        self.d_k = d_model // n_heads
        self.d_v = d_model // n_heads

class Encoder(nn.Module):
    """Transformers编码器"""
    
    def __init__(self, config: EncoderConfig):
        super().__init__()
        self.config = config
        
        # 嵌入层
        self.embedding = nn.Embedding(config.vocab_size, config.d_model)
        self.positional_encoding = PositionalEncoding(config.d_model, config.max_seq_len, config.dropout)
        
        # 编码器层堆叠
        self.layers = nn.ModuleList([
            EncoderLayer(config) for _ in range(config.n_layers)
        ])
        
        # 最终层归一化
        self.layer_norm = nn.LayerNorm(config.d_model, eps=config.layer_norm_eps)
        
    def forward(self, input_ids: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
        """
        前向传播
        
        Args:
            input_ids: [batch_size, seq_len] 输入token IDs
            attention_mask: [batch_size, seq_len] 注意力掩码
        
        Returns:
            output: [batch_size, seq_len, d_model] 编码器输出
            attention_weights: list of [batch_size, n_heads, seq_len, seq_len] 注意力权重
        """
        # 输入嵌入
        embeddings = self.embedding(input_ids)  # [batch_size, seq_len, d_model]
        
        # 位置编码
        embeddings = self.positional_encoding(embeddings)
        
        # 通过编码器层
        output = embeddings
        attention_weights = []
        
        for layer in self.layers:
            output, attn_weights = layer(output, attention_mask)
            attention_weights.append(attn_weights)
        
        # 最终层归一化
        output = self.layer_norm(output)
        
        return output, attention_weights

class PositionalEncoding(nn.Module):
    """位置编码模块"""
    
    def __init__(self, d_model: int, max_len: int, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        
        # 位置编码矩阵
        position = torch.arange(max_len).unsqueeze(1)  # [max_len, 1]
        div_term = torch.exp(torch.arange(0, d_model, 2) * 
                           -(math.log(10000.0) / d_model))
        
        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # 注册为缓冲区，不参与训练
        self.register_buffer('pe', pe)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: [batch_size, seq_len, d_model]
        
        Returns:
            x: [batch_size, seq_len, d_model] 添加位置编码后的张量
        """
        x = x + self.pe[:x.size(1)]
        return self.dropout(x)

# 测试编码器
config = EncoderConfig(d_model=512, n_heads=8, n_layers=6)
encoder = Encoder(config)

print(f"编码器参数量：{sum(p.numel() for p in encoder.parameters()):,}")

# 测试前向传播
batch_size, seq_len = 2, 10
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
attention_mask = torch.ones(batch_size, seq_len)

output, attn_weights = encoder(input_ids, attention_mask)
print(f"输出形状：{output.shape}")
print(f"注意力权重数量：{len(attn_weights)}")
print(f"每个注意力权重形状：{attn_weights[0].shape}")

步骤 2：Multi-Head Attention机制实现


class MultiHeadAttention(nn.Module):
    """多头注意力机制"""
    
    def __init__(self, config: EncoderConfig):
        super().__init__()
        self.config = config
        
        # 投影矩阵
        self.wq = nn.Linear(config.d_model, config.d_model)
        self.wk = nn.Linear(config.d_model, config.d_model)
        self.wv = nn.Linear(config.d_model, config.d_model)
        self.wo = nn.Linear(config.d_model, config.d_model)
        
        # Dropout
        self.dropout = nn.Dropout(config.dropout)
        
        # 缩放因子
        self.scale = math.sqrt(config.d_k)
    
    def forward(self, query: torch.Tensor, key: torch.Tensor, 
                value: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
        """
        多头注意力前向传播
        
        Args:
            query: [batch_size, seq_len_q, d_model] 查询向量
            key: [batch_size, seq_len_k, d_model] 键向量
            value: [batch_size, seq_len_v, d_model] 值向量
            attention_mask: [batch_size, seq_len_q, seq_len_k] 注意力掩码
        
        Returns:
            output: [batch_size, seq_len_q, d_model] 注意力输出
            attention_weights: [batch_size, n_heads, seq_len_q, seq_len_k] 注意力权重
        """
        batch_size = query.size(0)
        
        # 线性投影
        Q = self.wq(query)  # [batch_size, seq_len_q, d_model]
        K = self.wk(key)    # [batch_size, seq_len_k, d_model]
        V = self.wv(value)  # [batch_size, seq_len_v, d_model]
        
        # 重塑为多头格式
        Q = Q.view(batch_size, -1, self.config.n_heads, self.config.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.config.n_heads, self.config.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.config.n_heads, self.config.d_v).transpose(1, 2)
        
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        # 应用注意力掩码
        if attention_mask is not None:
            # 掩码形状需要与scores匹配
            mask = attention_mask.unsqueeze(1).unsqueeze(1)  # [batch_size, 1, 1, seq_len]
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Softmax计算注意力权重
        attention_weights = torch.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # 应用注意力权重到值向量
        context = torch.matmul(attention_weights, V)
        
        # 重塑回原始形状
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, -1, self.config.d_model)
        
        # 最终线性投影
        output = self.wo(context)
        
        return output, attention_weights

class SelfAttention(nn.Module):
    """自注意力机制（编码器中使用）"""
    
    def __init__(self, config: EncoderConfig):
        super().__init__()
        self.multihead_attention = MultiHeadAttention(config)
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
        """
        Args:
            x: [batch_size, seq_len, d_model] 输入向量
            attention_mask: [batch_size, seq_len, seq_len] 注意力掩码
        
        Returns:
            output: [batch_size, seq_len, d_model] 注意力输出
            attention_weights: [batch_size, n_heads, seq_len, seq_len] 注意力权重
        """
        return self.multihead_attention(x, x, x, attention_mask)

# 测试多头注意力
mha = MultiHeadAttention(config)
self_attn = SelfAttention(config)

batch_size, seq_len = 2, 10
x = torch.randn(batch_size, seq_len, config.d_model)
attention_mask = torch.ones(batch_size, seq_len)

output, attn_weights = self_attn(x, attention_mask)
print(f"自注意力输出形状：{output.shape}")
print(f"注意力权重形状：{attn_weights.shape}")

步骤 3：Feed Forward Network实现


class FeedForward(nn.Module):
    """前馈神经网络"""
    
    def __init__(self, config: EncoderConfig):
        super().__init__()
        self.config = config
        
        # 两层线性变换
        self.linear1 = nn.Linear(config.d_model, config.d_ff)
        self.linear2 = nn.Linear(config.d_ff, config.d_model)
        
        # GELU激活函数
        self.gelu = nn.GELU()
        
        # Dropout
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: [batch_size, seq_len, d_model] 输入向量
        
        Returns:
            output: [batch_size, seq_len, d_model] FFN输出
        """
        # 第一层：d_model -> d_ff
        x = self.linear1(x)
        x = self.gelu(x)
        x = self.dropout(x)
        
        # 第二层：d_ff -> d_model
        x = self.linear2(x)
        x = self.dropout(x)
        
        return x

class PositionWiseFeedForward(nn.Module):
    """位置前馈网络（编码器中使用）"""
    
    def __init__(self, config: EncoderConfig):
        super().__init__()
        self.feed_forward = FeedForward(config)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.feed_forward(x)

# 测试前馈网络
ffn = FeedForward(config)
pwffn = PositionWiseFeedForward(config)

x = torch.randn(batch_size, seq_len, config.d_model)
output = pwffn(x)
print(f"前馈网络输出形状：{output.shape}")

# 激活函数可视化
plt.figure(figsize=(8, 4))

x_range = torch.linspace(-3, 3, 100)
gelu_values = torch.nn.GELU()(x_range).detach().numpy()
relu_values = F.relu(x_range).numpy()

plt.subplot(1, 2, 1)
plt.plot(x_range.numpy(), gelu_values, label='GELU')
plt.plot(x_range.numpy(), relu_values, label='ReLU', linestyle='--')
plt.title('Activation Functions')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(x_range.numpy(), gelu_values)
plt.title('GELU Activation')
plt.grid(True)

plt.tight_layout()
plt.show()

步骤 4：完整编码器层实现


class EncoderLayer(nn.Module):
    """编码器层"""
    
    def __init__(self, config: EncoderConfig):
        super().__init__()
        self.config = config
        
        # 自注意力子层
        self.self_attention = SelfAttention(config)
        
        # 前馈网络子层
        self.feed_forward = PositionWiseFeedForward(config)
        
        # 层归一化
        self.layer_norm1 = nn.LayerNorm(config.d_model, eps=config.layer_norm_eps)
        self.layer_norm2 = nn.LayerNorm(config.d_model, eps=config.layer_norm_eps)
        
        # Dropout
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
        """
        Args:
            x: [batch_size, seq_len, d_model] 输入向量
            attention_mask: [batch_size, seq_len, seq_len] 注意力掩码
        
        Returns:
            output: [batch_size, seq_len, d_model] 编码器层输出
            attention_weights: [batch_size, n_heads, seq_len, seq_len] 注意力权重
        """
        # 子层1：自注意力 + 残差连接 + 层归一化
        residual = x
        
        # 自注意力计算
        attn_output, attention_weights = self.self_attention(x, attention_mask)
        attn_output = self.dropout(attn_output)
        x = self.layer_norm1(residual + attn_output)
        
        # 子层2：前馈网络 + 残差连接 + 层归一化
        residual = x
        
        # FFN计算
        ff_output = self.feed_forward(x)
        ff_output = self.dropout(ff_output)
        x = self.layer_norm2(residual + ff_output)
        
        return x, attention_weights

# 测试编码器层
encoder_layer = EncoderLayer(config)
x = torch.randn(batch_size, seq_len, config.d_model)
output, attn_weights = encoder_layer(x, attention_mask)
print(f"编码器层输出形状：{output.shape}")
print(f"注意力权重形状：{attn_weights.shape}")

# 参数统计
def print_model_stats(model, name=""):
    """打印模型统计信息"""
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"=== {name} ===")
    print(f"总参数量：{total_params:,}")
    print(f"可训练参数量：{trainable_params:,}")
    print(f"参数密度：{trainable_params/total_params:.2%}")
    print()

print_model_stats(encoder, "完整编码器")
print_model_stats(encoder_layer, "编码器层")
print_model_stats(SelfAttention(config), "自注意力模块")
print_model_stats(FeedForward(config), "前馈网络模块")

常见问题 FAQ

Q1：为什么编码器中使用残差连接？

A：残差连接解决了深度网络中的梯度消失问题，允许信息更直接地在前向和反向传播中流动。它使得训练更深的网络成为可能，同时保留了原始输入信息，避免了信息损失。在编码器中，残差连接让模型能够更好地保留输入信息，同时学习复杂的变换。

Q2：Multi-Head Attention为什么需要多个头？

A：多个注意力头让模型能够同时关注序列中不同位置的信息，每个头学习不同的表示模式。这类似于CNN中多个滤波器的概念，不同的头可能捕获不同类型的依赖关系：有的头关注局部信息，有的关注全局信息，有的关注语法结构等。这种多样性增强了模型的表示能力。

Q3：Feed Forward Network为什么需要两层？

A：第一层将维度从d_model扩展到d_ff（通常是4倍），提供了更大的表示空间来学习复杂的非线性变换；第二层将维度缩减回d_model，将学到的复杂特征映射回原始维度。这种扩展-收缩设计让FFN能够学习更复杂的函数，同时保持与整体架构的兼容性。

Q4：Layer Normalization和Batch Normalization有什么区别？

A：Layer Normalization对单个样本的所有特征进行归一化，而Batch Normalization对同一批次的所有样本的同一特征进行归一化。在编码器中，Layer Normalization更适合，因为：1）序列长度可变；2）每个位置的语义含义不同；3）自然语言处理任务中批次间的相关性较弱。

Q5：编码器层数越多效果越好吗？

A：不一定。层数增加确实提高了模型容量，但也会带来问题：1）梯度消失/爆炸；2）计算成本增加；3）过拟合风险；4）训练困难。实际应用中需要根据任务复杂度、数据量和计算资源选择合适的层数。BERT-base使用12层，BERT-large使用24层，这是一个平衡的选择。

最佳实践与避坑

实践 1：使用预归一化（Pre-normalization）提高训练稳定性
实践 2：合理设置注意力头数量（通常是模型维度的1/8或1/16）
实践 3：使用梯度裁剪防止梯度爆炸
实践 4：对于长序列，使用相对位置编码改进性能
实践 5：在推理时使用缓存机制加速生成任务
坑点 1：注意力掩码设置错误，导致模型无法正确处理变长序列
坑点 2：忘记缩放注意力分数，导致梯度不稳定
坑点 3：层归一化位置错误，影响训练收敛性
坑点 4：内存使用不当，导致GPU内存溢出
坑点 5：初始化不当，导致训练初期数值不稳定

本节小结

本节深入讲解了Transformers编码器的核心架构，涵盖了Multi-Head Attention、Feed Forward Network、Layer Normalization等关键组件的实现细节。通过本节的学习，读者掌握了编码器的实现原理和优化方法，能够构建高性能的编码器模型。

延伸阅读

关键词：编码器, Multi-Head Attention, 前馈网络, 层归一化, 残差连接, 注意力机制
难度：进阶
预计阅读：45分钟