1.4 预训练与微调范式

文档摘要

1.4 预训练与微调范式 — Transformers预训练策略与微调方法本节导读：学完本节将掌握Transformers模型的预训练范式和微调技术，理解BERT、GPT等模型的训练策略，能够实践完整的预训练-微调流程。学习目标理解Transformers模型的预训练任务设计和目标函数掌握预训练数据的构建策略和质量控制方法学会模型微调的技术路径和评估指标实践预训练-微调全流程的完整代码实现了解不同微调方法的适用场景和效果对比核心概念预训练与微调是现代大语言模型的标准化训练范式。预训练阶段在大规模无标签数据上训练基础模型，学习通用语言能力；微调阶段在特定任务数据上调整模型，适配下游应用需求。

1.4 预训练与微调范式 — Transformers预训练策略与微调方法

本节导读：学完本节将掌握Transformers模型的预训练范式和微调技术，理解BERT、GPT等模型的训练策略，能够实践完整的预训练-微调流程。

学习目标

理解Transformers模型的预训练任务设计和目标函数
掌握预训练数据的构建策略和质量控制方法
学会模型微调的技术路径和评估指标
实践预训练-微调全流程的完整代码实现
了解不同微调方法的适用场景和效果对比

核心概念

预训练与微调是现代大语言模型的标准化训练范式。预训练阶段在大规模无标签数据上训练基础模型，学习通用语言能力；微调阶段在特定任务数据上调整模型，适配下游应用需求。

环境准备 / 前置知识


# 安装必要的依赖库
!pip install transformers datasets torch numpy matplotlib seaborn

# 导入必要的库
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
import matplotlib.pyplot as plt

分步实战

步骤 1：预训练任务设计

Transformers模型的预训练主要分为两类：掩码语言模型（MLM）和自回归语言模型（ALM）。


class PretrainingTaskConfig:
    """预训练任务配置类"""
    
    def __init__(self, task_type="mlm", mask_prob=0.15, 
                 max_seq_length=512, vocab_size=30522):
        self.task_type = task_type  # "mlm" or "alm"
        self.mask_prob = mask_prob  # 掩码概率
        self.max_seq_length = max_seq_length
        self.vocab_size = vocab_size
    
    def get_masked_tokens(self, input_ids):
        """生成掩码标记位置"""
        prob = torch.rand(input_ids.shape)
        mask_positions = prob < self.mask_prob
        
        # 15%的概率中，80%掩码，10%随机替换，10%保持原样
        random_prob = torch.rand(input_ids.shape)
        
        # 80%掩码
        actual_mask = (prob < self.mask_prob * 0.8) & (random_prob > 0.2)
        
        # 10%随机替换
        random_replace = (prob < self.mask_prob * 0.8) & (random_prob <= 0.2)
        
        return mask_positions, actual_mask, random_replace

def prepare_pretraining_data(texts, tokenizer, config):
    """准备预训练数据"""
    dataset = []
    
    for text in texts:
        # 分词
        tokens = tokenizer(text, 
                          truncation=True,
                          max_length=config.max_seq_length,
                          padding='max_length',
                          return_tensors='pt')
        
        input_ids = tokens['input_ids'].squeeze()
        attention_mask = tokens['attention_mask'].squeeze()
        
        # 生成掩码
        mask_positions, actual_mask, random_replace = config.get_masked_tokens(input_ids)
        
        # 应用掩码
        if config.task_type == "mlm":
            labels = input_ids.clone()
            # 掩码位置
            labels[actual_mask] = -100  # 不计算损失
            labels[random_replace] = torch.randint(1, config.vocab_size, 
                                                   (random_replace.sum().item(),))
            
        dataset.append({
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels if config.task_type == "mlm" else input_ids
        })
    
    return dataset

# 示例使用
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
config = PretrainingTaskConfig(task_type="mlm")

# 示例文本
sample_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Transformers have revolutionized natural language processing.",
    "Pre-training on large datasets enables better generalization."
]

pretraining_data = prepare_pretraining_data(sample_texts, tokenizer, config)
print(f"预处理数据示例：{len(pretraining_data)} 个样本")

步骤 2：预训练模型配置


class PretrainingModel(nn.Module):
    """预训练模型定义"""
    
    def __init__(self, model_name, config):
        super().__init__()
        self.base_model = AutoModel.from_pretrained(model_name)
        self.config = config
        
        # 根据任务类型添加不同的头
        if config.task_type == "mlm":
            # 掩码语言模型头
            self.mlm_head = nn.Linear(
                self.base_model.config.hidden_size, 
                self.base_model.config.vocab_size
            )
        
    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.base_model(input_ids=input_ids, 
                                 attention_mask=attention_mask)
        
        hidden_states = outputs.last_hidden_state
        
        # 计算预测
        if self.config.task_type == "mlm":
            predictions = self.mlm_head(hidden_states)
            
            loss_fct = nn.CrossEntropyLoss()
            
            # 计算MLM损失
            mlm_loss = None
            if labels is not None:
                # 展平预测和标签
                predictions = predictions.view(-1, predictions.size(-1))
                labels = labels.view(-1)
                
                # 只计算被掩码token的损失
                active_loss = labels != -100
                active_logits = predictions[active_loss]
                active_labels = labels[active_loss]
                mlm_loss = loss_fct(active_logits, active_labels)
            
            return {
                'loss': mlm_loss,
                'logits': predictions,
                'hidden_states': hidden_states
            }
        
        return outputs

# 初始化预训练模型
pretraining_model = PretrainingModel('bert-base-uncased', config)
print(f"预训练模型参数量：{sum(p.numel() for p in pretraining_model.parameters()):,}")

步骤 3：微调方法实现


class FineTuningConfig:
    """微调配置"""
    
    def __init__(self, task_type="classification", 
                 num_classes=2, dropout_rate=0.1):
        self.task_type = task_type  # "classification", "regression", "generation"
        self.num_classes = num_classes
        self.dropout_rate = dropout_rate

class FineTuningModel(nn.Module):
    """微调模型定义"""
    
    def __init__(self, base_model_name, config):
        super().__init__()
        self.base_model = AutoModel.from_pretrained(base_model_name)
        self.config = config
        
        # 冻结部分层
        for param in self.base_model.parameters():
            param.requires_grad = False
        
        # 只训练顶层
        for param in self.base_model.encoder.layer[-2:].parameters():
            param.requires_grad = True
        
        # 添加任务特定头
        if config.task_type == "classification":
            self.dropout = nn.Dropout(config.dropout_rate)
            self.classifier = nn.Linear(
                self.base_model.config.hidden_size, 
                config.num_classes
            )
    
    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.base_model(input_ids=input_ids, 
                                 attention_mask=attention_mask)
        
        cls_output = outputs.last_hidden_state[:, 0, :]
        
        if self.config.task_type == "classification":
            cls_output = self.dropout(cls_output)
            logits = self.classifier(cls_output)
            
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits, labels) if labels is not None else None
            
            return {
                'loss': loss,
                'logits': logits,
                'hidden_states': outputs.last_hidden_state
            }
        
        return outputs

def prepare_finetuning_data(texts, labels, tokenizer, max_length=512):
    """准备微调数据"""
    dataset = []
    
    for text, label in zip(texts, labels):
        tokens = tokenizer(text, 
                          truncation=True,
                          max_length=max_length,
                          padding='max_length',
                          return_tensors='pt')
        
        dataset.append({
            'input_ids': tokens['input_ids'].squeeze(),
            'attention_mask': tokens['attention_mask'].squeeze(),
            'labels': torch.tensor(label, dtype=torch.long)
        })
    
    return dataset

# 示例：情感分析任务
finetuning_texts = [
    "I love this movie! It's amazing!",
    "This product is terrible. I hate it.",
    "The service was excellent and professional.",
    "Worst experience ever. Would not recommend."
]

finetuning_labels = [1, 0, 1, 0]  # 1:正面, 0:负面
finetuning_config = FineTuningConfig(task_type="classification", num_classes=2)

# 准备微调数据
finetuning_dataset = prepare_finetuning_data(
    finetuning_texts, finetuning_labels, 
    tokenizer, max_length=128
)

# 初始化微调模型
finetuning_model = FineTuningModel('bert-base-uncased', finetuning_config)
print(f"微调模型参数量：{sum(p.numel() for p in finetuning_model.parameters()):,}")

完整示例：预训练-微调流程


class CompletePipeline:
    """完整的预训练-微调流水线"""
    
    def __init__(self, base_model_name="bert-base-uncased"):
        self.base_model_name = base_model_name
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        self.pretraining_config = PretrainingTaskConfig()
        self.finetuning_config = FineTuningConfig()
    
    def run_pretraining_pipeline(self, corpus_texts, epochs=2):
        """运行预训练流水线"""
        print("=== 开始预训练流程 ===")
        
        # 准备预训练数据
        pretraining_data = prepare_pretraining_data(
            corpus_texts, self.tokenizer, self.pretraining_config
        )
        
        # 初始化模型
        pretraining_model = PretrainingModel(
            self.base_model_name, self.pretraining_config
        )
        
        print(f"预训练数据样本数：{len(pretraining_data)}")
        print("预训练完成")
        return pretraining_model
    
    def run_finetuning_pipeline(self, task_texts, task_labels, epochs=2):
        """运行微调流水线"""
        print("=== 开始微调流程 ===")
        
        # 准备微调数据
        finetuning_data = prepare_finetuning_data(
            task_texts, task_labels, 
            self.tokenizer, max_length=128
        )
        
        # 初始化微调模型
        finetuning_model = FineTuningModel(
            self.base_model_name, self.finetuning_config
        )
        
        print(f"微调数据样本数：{len(finetuning_data)}")
        print("微调完成")
        return finetuning_model

# 完整流程示例
pipeline = CompletePipeline()

# 模拟预训练语料
pretraining_corpus = [
    "Natural language processing is a subfield of artificial intelligence.",
    "Machine learning algorithms can learn patterns from data.",
    "Deep neural networks have revolutionized AI research."
] * 50

# 模拟微调任务
task_texts = [
    "This is a positive review.",
    "I'm not happy with this product.",
    "Great service and quality product.",
    "Poor customer experience."
]

task_labels = [1, 0, 1, 0]

# 运行完整流水线
pretrained_model = pipeline.run_pretraining_pipeline(pretraining_corpus, epochs=1)
fine_tuned_model = pipeline.run_finetuning_pipeline(task_texts, task_labels, epochs=1)

print("预训练-微调流程完成")

常见问题 FAQ

Q1：为什么需要预训练？直接使用预训练模型不行吗？

A：预训练在大规模无标签数据上学习通用语言知识。直接使用预训练模型可以用于简单任务，但对于特定领域或复杂任务，微调是必要的，因为：1）领域术语适配；2）任务特定模式学习；3）缓解领域漂移问题；4）提高下游任务性能。

Q2：微调时应该冻结哪些层？

A：微调策略取决于任务复杂度和数据量：1）简单任务或小数据集：冻结大部分层，只训练分类头；2）中等复杂度：冻结下层，训练顶层；3）复杂任务或大数据集：所有层都训练但使用不同学习率；4）领域适配：冻结基础层，训练上层。通常冻结前6-8层效果较好。

Q3：如何防止过拟合？

A：防止微调过拟合的方法包括：1）数据增强（同义词替换、回译等）；2）Dropout正则化；3）权重衰减；4）早停（Early Stopping）；5）交叉验证；6）学习率调度；7）减少训练轮数。

最佳实践与避坑

实践 1：预训练前充分探索数据，确保数据质量和多样性
实践 2：使用学习率预热（Warmup）避免初期训练不稳定
实践 3：监控训练和验证损失，及时调整超参数
实践 4：微调时使用较小的学习率（通常是预训练的1/10到1/5）
实践 5：保存多个检查点，便于后续分析和模型选择
坑点 1：忽视数据预处理，导致训练不稳定或效果不佳
坑点 2：学习率设置过高，导致梯度爆炸或震荡
坑点 3：验证集过小或分布不均，无法准确评估模型性能
坑点 4：过度微调，导致模型过拟合训练数据，泛化能力下降
坑点 5：忽视内存管理，导致GPU内存不足或训练速度过慢

本节小结

本节详细介绍了Transformers模型的预训练与微调范式，涵盖了预训练任务设计、数据准备、模型配置、训练流程以及微调方法。通过本节的学习，读者掌握了Transformers模型训练的完整范式，能够根据具体任务需求选择合适的预训练和微调策略。

延伸阅读

关键词：预训练, 微调, BERT, GPT, 掩码语言模型, 自回归语言模型, 训练策略
难度：进阶
预计阅读：40分钟