2.2 文档预处理

文档摘要

2.2 文档预处理 — Haystack 文本清洗与分割技术本节导读：掌握Haystack的文档预处理核心技术，包括文本清洗、智能分割、质量控制和批量处理，为企业级RAG系统构建高质量的文本数据管道。学习目标掌握DocumentCleaner的各种清洗策略和参数配置理解不同文档分割策略的特点和适用场景学会构建完整的预处理管道，实现端到端的文本处理了解预处理质量评估方法和优化技巧能够处理复杂文档格式和特殊分割需求核心概念文档预处理是RAG系统质量的关键瓶颈，直接影响后续检索和生成的准确性。

2.2 文档预处理 — Haystack 文本清洗与分割技术

本节导读：掌握Haystack的文档预处理核心技术，包括文本清洗、智能分割、质量控制和批量处理，为企业级RAG系统构建高质量的文本数据管道。

学习目标

掌握DocumentCleaner的各种清洗策略和参数配置
理解不同文档分割策略的特点和适用场景
学会构建完整的预处理管道，实现端到端的文本处理
了解预处理质量评估方法和优化技巧
能够处理复杂文档格式和特殊分割需求

核心概念

文档预处理是RAG系统质量的关键瓶颈，直接影响后续检索和生成的准确性。Haystack提供了强大的预处理组件，分为两大核心类别：

🧹 文本清洗组件

DocumentCleaner：基础文本清洗，去除空白、重复内容、页眉页脚
DocumentPreprocessor：组合式预处理，先分割后清洗
** specialized cleaners**：针对特定格式的清洗器

✂️ 文档分割组件

DocumentSplitter：基础分割器，支持按词、句、段、页分割
EmbeddingBasedDocumentSplitter：基于语义相似度的智能分割
HierarchicalDocumentSplitter：层次化分割，保持文档结构

环境准备 / 前置知识


# 基础依赖安装
pip install haystack-ai
pip install "haystack-ai[preprocessing]"  # 包含高级预处理功能
pip install chonkie  # 高级分割器（可选）
pip install hanlp    # 中文处理支持（可选）

# 验证安装
from haystack import Document
from haystack.components.preprocessors import (
    DocumentCleaner, 
    DocumentSplitter,
    DocumentPreprocessor
)
print("预处理组件加载成功")

分步实战

步骤 1：基础文本清洗


from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

# 创建示例文档
sample_text = """
    Hello World   
    
    This is a test document with extra whitespaces.
    
    
    Some repeated content repeated content repeated content.
    
    Header information that should be removed
    Main content here
    Footer information that should be removed
    
    Another line with extra   spaces
"""

document = Document(content=sample_text)
print("原始文档字数:", len(document.content))

# 配置DocumentCleaner
cleaner = DocumentCleaner(
    remove_empty_lines=True,           # 删除空行
    remove_extra_whitespaces=True,    # 删除多余空格
    remove_substrings=["Header", "Footer"],  # 移除指定字符串
    remove_regex=r'\b\d{4}-\d{2}-\d{2}\b'   # 移除日期格式
)

# 执行清洗
cleaned_docs = cleaner.run(documents=[document])
cleaned_text = cleaned_docs["documents"][0].content

print("清洗后文档字数:", len(cleaned_text))
print("清洗效果展示:")
print("-" * 50)
print(cleaned_text)

输出分析：

原始文档：包含大量空白行、多余空格、重复内容
清洗后：去除了空白行、多余空格、指定的Header/Footer、日期格式
效果：文本更紧凑，提高了后续处理效率

步骤 2：基础文档分割


from haystack.components.preprocessors import DocumentSplitter

# 长文档示例
long_text = """
机器学习是人工智能的一个分支，它使计算机系统能够从数据中学习并改进性能，而无需明确编程。
监督学习是最常见的机器学习方法之一，它使用标记的训练数据来学习输入和输出之间的映射关系。
无监督学习则处理未标记的数据，试图发现数据中的内在结构和模式。
强化学习通过试错来学习智能体如何在环境中采取行动以获得最大奖励。
深度学习使用神经网络来模拟人脑的学习过程，特别适合处理复杂的非线性问题。
自然语言处理是人工智能的重要应用领域，涉及计算机理解、解释和生成人类语言。
计算机视觉使机器能够"看懂"和理解图像、视频等视觉信息。
语音识别技术将人类语音转换为文本，为人机交互提供了自然的方式。
知识图谱表示实体之间的关系，支持语义搜索和推理。
联邦学习允许在不共享原始数据的情况下进行模型训练，保护数据隐私。
边缘计算将计算资源推向数据产生的源头，减少延迟和网络带宽使用。
量子计算利用量子力学原理来解决经典计算机难以处理的复杂问题。
神经网络由相互连接的神经元组成，能够学习复杂的非线性映射。
迁移学习将在一个任务上学到的知识应用到相关任务中，提高学习效率。
生成对抗网络通过生成器和判别器的对抗训练创造出逼真的内容。
大语言模型在海量文本数据上训练，能够理解和生成人类语言。
强化学习的核心是奖励机制，智能体通过最大化累积奖励来学习最优策略。
多模态学习同时处理文本、图像、音频等多种数据类型，提供更全面的理解。
"""

long_document = Document(content=long_text)

# 配置分割器
splitter = DocumentSplitter(
    split_by="sentence",      # 按句子分割
    split_length=3,           # 每段3个句子
    split_overlap=1,         # 重叠1个句子保持上下文
)

# 执行分割
split_docs = splitter.run(documents=[long_document])

print(f"原始文档: 1个文档, 总字数: {len(long_text)}")
print(f"分割结果: {len(split_docs['documents'])}个子文档")
print("分割后的子文档示例:")
for i, doc in enumerate(split_docs['documents'][:3]):
    print(f"文档 {i+1}: {doc.content[:50]}...")

输出分析：

原始1个长文档被分割为多个子文档
每个子文档包含3个句子，相邻文档有1个句子重叠
重叠设计确保了上下文连贯性，避免信息丢失

步骤 3：高级语义分割


from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline

# 创建嵌入器
embedder = SentenceTransformersDocumentEmbedder(model_name="all-MiniLM-L6-v2")

# 配置语义分割器
semantic_splitter = EmbeddingBasedDocumentSplitter(
    embedder=embedder,
    split_by="sentence",
    split_length=5,
    split_overlap=1,
    similarity_threshold=0.7  # 语义相似度阈值
)

# 构建预处理管道
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component("cleaner", DocumentCleaner())
preprocessing_pipeline.add_component("embedder", embedder)
preprocessing_pipeline.add_component("semantic_splitter", semantic_splitter)

# 连接管道
preprocessing_pipeline.connect("cleaner.documents", "embedder.documents")
preprocessing_pipeline.connect("embedder.documents", "semantic_splitter.documents")

# 运行管道
result = preprocessing_pipeline.run(
    data={
        "cleaner": {"documents": [long_document]}
    }
)

print(f"语义分割结果: {len(result['semantic_splitter']['documents'])}个子文档")
print("语义分割优势:")
print("- 保持语义完整性")
print("- 避免在句子中间切断")
print("- 相关内容保持在一起")

步骤 4：复杂文档处理


import re
from typing import List, Dict
from haystack import Document

class CustomDocumentProcessor:
    """自定义文档处理器，处理复杂业务场景"""
    
    def __init__(self, config: Dict):
        self.config = config
        self.cleaner = DocumentCleaner(**config.get('cleaner', {}))
        self.splitter = DocumentSplitter(**config.get('splitter', {}))
        
    def process_documents(self, documents: List[Document]) -> List[Document]:
        """完整预处理流程"""
        # 第一阶段：清洗
        cleaned_docs = self.cleaner.run(documents=documents)["documents"]
        
        # 第二阶段：过滤和验证
        filtered_docs = self._filter_documents(cleaned_docs)
        
        # 第三阶段：分割
        split_docs = self.splitter.run(documents=filtered_docs)["documents"]
        
        # 第四阶段：质量检查
        quality_docs = self._quality_check(split_docs)
        
        return quality_docs
    
    def _filter_documents(self, documents: List[Document]) -> List[Document]:
        """过滤低质量文档"""
        filtered = []
        min_length = self.config.get('min_length', 50)
        max_length = self.config.get('max_length', 5000)
        
        for doc in documents:
            if min_length <= len(doc.content) <= max_length:
                filtered.append(doc)
        
        return filtered
    
    def _quality_check(self, documents: List[Document]) -> List[Document]:
        """质量检查和评分"""
        for doc in documents:
            # 简单的文本质量评估
            quality_score = self._calculate_quality_score(doc.content)
            doc.meta = doc.meta or {}
            doc.meta['quality_score'] = quality_score
            
        return documents
    
    def _calculate_quality_score(self, text: str) -> float:
        """计算文本质量分数"""
        score = 0.0
        
        # 长度评分（理想长度200-1000字符）
        if 200 <= len(text) <= 1000:
            score += 0.3
        elif len(text) >= 100:
            score += 0.1
            
        # 词汇密度评分
        words = text.split()
        if len(words) > 10:
            score += 0.2
            
        # 特殊字符比例评分
        special_chars = sum(1 for c in text if c in '.,!?;:"')
        if len(words) > 0 and special_chars / len(words) < 0.5:
            score += 0.2
            
        # 重复内容评分
        if len(set(text.lower().split())) / len(text.lower().split()) > 0.7:
            score += 0.3
            
        return min(score, 1.0)

# 使用自定义处理器
config = {
    'cleaner': {
        'remove_empty_lines': True,
        'remove_extra_whitespaces': True,
        'remove_substrings': ['广告', '推广']
    },
    'splitter': {
        'split_by': 'sentence',
        'split_length': 4,
        'split_overlap': 1
    },
    'min_length': 100,
    'max_length': 800
}

processor = CustomDocumentProcessor(config)
processed_docs = processor.process_documents([long_document])

print(f"自定义处理结果: {len(processed_docs)}个高质量文档")
print("文档质量示例:")
for i, doc in enumerate(processed_docs[:2]):
    print(f"文档 {i+1}: 质量分数={doc.meta.get('quality_score', 0):.2f}")

完整示例


from haystack import Pipeline, Document
from haystack.components.preprocessors import (
    DocumentCleaner, 
    DocumentSplitter,
    DocumentPreprocessor
)
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.components.converters import PyPDFToDocument
from haystack.document_stores.in_memory import InMemoryDocumentStore

# 创建完整的文档预处理管道
def create_preprocessing_pipeline():
    """创建端到端的文档预处理管道"""
    
    # 文档存储
    doc_store = InMemoryDocumentStore()
    
    # 管道配置
    pipeline = Pipeline()
    
    # 1. PDF转文档
    pipeline.add_component("converter", PyPDFToDocument())
    
    # 2. 文本清洗
    pipeline.add_component("cleaner", DocumentCleaner(
        remove_empty_lines=True,
        remove_extra_whitespaces=True,
        remove_substrings=["Page", "Copyright", "CONFIDENTIAL"],
        remove_regex=r'\b\d+\b'  # 移除单独的数字
    ))
    
    # 3. 文档分割
    pipeline.add_component("splitter", DocumentSplitter(
        split_by="sentence",
        split_length=3,
        split_overlap=1
    ))
    
    # 4. 嵌入（可选）
    pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(
        model_name="all-MiniLM-L6-v2"
    ))
    
    # 5. 写入存储
    pipeline.add_component("writer", DocumentWriter(document_store=doc_store))
    
    # 连接管道
    pipeline.connect("converter.documents", "cleaner.documents")
    pipeline.connect("cleaner.documents", "splitter.documents")
    pipeline.connect("splitter.documents", "writer.documents")
    
    return pipeline, doc_store

# 使用管道
pipeline, doc_store = create_preprocessing_pipeline()

# 模拟PDF文档（实际应用中使用真实PDF）
pdf_text = """
    第一页内容
    
    机器学习是人工智能的核心技术之一。它使计算机能够从数据中学习并改进性能。
    
    监督学习使用标记数据训练模型，而无监督学习则从未标记数据中发现模式。
    
    第二页内容
    
    深度学习使用神经网络处理复杂问题，如图像识别和自然语言处理。
    
    大语言模型在海量文本上训练，能够理解和生成人类语言。
    
    Copyright 2023 Some Company
    Page 2 of 2
"""

# 创建测试文档
test_docs = [Document(content=pdf_text)]

# 运行管道
result = pipeline.run(data={"converter": {"sources": test_docs}})

print(f"管道处理完成，存储了 {len(doc_store.get_all_documents())} 个文档")
print("文档统计:")
for i, doc in enumerate(doc_store.get_all_documents()[:3]):
    print(f"文档 {i+1}: 字数={len(doc.content)}, 元数据={doc.meta}")

常见问题 FAQ

Q1：如何选择合适的分割策略？

A：根据文档类型和业务需求选择：

sentence分割：适合学术文档、技术文档，保证句子完整性
passage分割：适合段落性内容，保持上下文连贯性
word分割：适合需要精确控制长度的场景
语义分割：适合需要保持内容相关性的场景

Q2：如何处理特殊文档格式？

A：使用Haystack的专门组件：

MarkdownHeaderSplitter：按Markdown标题分割文档
CSVDocumentCleaner/CSVDocumentSplitter：处理CSV表格
ChineseDocumentSplitter：处理中文文档（使用HanLP）

Q3：如何优化预处理性能？

A：性能优化技巧：

批量处理文档，减少单个文档处理开销
使用GPU加速嵌入计算
并行处理独立的清洗和分割任务
缓存常用嵌入模型，避免重复加载
设置合理的分割参数，避免过度细粒度

Q4：如何评估预处理质量？

A：建立质量评估体系：

长度分布：检查分割后的文档长度是否符合预期
上下文连续性：验证相邻文档间的逻辑连贯性
信息完整性：确保重要信息没有被错误分割
语义保持：使用嵌入相似度验证语义一致性
处理效率：监控预处理速度和资源消耗

最佳实践与避坑

实践 1：预处理管道设计原则

从简到繁：先基础清洗，再深度分割
参数调优：根据文档类型调整split_length和split_overlap
错误处理：增加异常处理机制，防止单个文档失败影响整体

实践 2：企业级预处理优化

增量处理：对新增文档进行增量预处理
版本控制：保存预处理配置版本，便于回滚和优化
监控报警：建立预处理质量监控，异常时触发告警

坑点 1：过度分割导致的上下文丢失

问题：分割粒度过细，失去文档整体语义
解决：合理设置split_overlap，保持上下文连贯性

坑点 2：忽略文档特殊格式需求

问题：统一处理导致特殊格式文档质量下降
解决：针对不同文档类型使用专门的处理策略

坑点 3：预处理资源消耗过大

问题：大批量文档处理消耗过多计算资源
解决：实现批处理和资源限流，平衡处理效率和质量

本节小结

通过本节学习，你掌握了Haystack文档预处理的核心技术，包括：

文本清洗策略：去除空白、重复内容、页眉页脚
智能分割技术：按句子、段落、语义进行分割
质量控制方法：过滤、验证、评分机制
完整管道构建：端到端的文档处理流水线

文档预处理是RAG系统质量的关键基础，只有高质量的预处理才能确保后续检索和生成的准确性。建议在实际项目中根据具体业务需求调整预处理参数，并建立质量监控机制。

下一节将深入探讨文档存储方案，学习如何选择和配置适合企业级应用的Document Store。

关键词：文档预处理, 文本清洗, 文档分割, Haystack预处理, 质量控制, 企业级处理
难度：进阶
预计阅读：45 分钟