5. Elasticsearch 核心操作：搜索与查询

文档摘要

Elasticsearch 核心操作：搜索与查询 Elasticsearch 核心操作：搜索与查询详解 (Elasticsearch 5.x) 引言为何选择 Elasticsearch 5.x？虽然 Elasticsearch 已经发展到更高版本，但 5.x 版本仍然是许多企业和开发者正在使用的稳定版本。深入理解 5.x 的搜索与查询机制，对于理解后续版本，乃至其他搜索引擎，都具有重要的基础意义。本文结构本文将围绕以下几个方面展开，全面解析 Elasticsearch 5.x 的搜索与查询：核心概念回顾：简要回顾 Elasticsearch 中的索引、文档、字段等核心概念，为后续内容打下基础。

5. Elasticsearch 核心操作：搜索与查询

Elasticsearch 核心操作：搜索与查询详解 (Elasticsearch 5.x)

1. 引言

为何选择 Elasticsearch 5.x？

虽然 Elasticsearch 已经发展到更高版本，但 5.x 版本仍然是许多企业和开发者正在使用的稳定版本。深入理解 5.x 的搜索与查询机制，对于理解后续版本，乃至其他搜索引擎，都具有重要的基础意义。

本文结构

本文将围绕以下几个方面展开，全面解析 Elasticsearch 5.x 的搜索与查询：

核心概念回顾：简要回顾 Elasticsearch 中的索引、文档、字段等核心概念，为后续内容打下基础。
Search API 基础：介绍 Elasticsearch 的 _search API，这是进行搜索与查询的入口。
Query DSL 详解：深入剖析 Elasticsearch 的 Query DSL (Domain Specific Language)，这是构建复杂查询的核心工具，包括：
- 基本查询 (Term-level queries)：精确匹配、范围查询等。
- 全文查询 (Full-text queries)：模糊匹配、短语匹配、高亮显示等。
- 复合查询 (Compound queries)：布尔查询、提升权重等。
代码实践：常用查询示例：通过实际的代码示例，演示各种常用查询的用法，并进行详细解释。
性能优化与最佳实践：探讨如何优化搜索与查询性能，并总结一些最佳实践。
总结与展望：总结本文内容，并对 Elasticsearch 的搜索与查询功能进行展望。

2. 核心概念回顾

在深入搜索与查询之前，我们先简要回顾 Elasticsearch 中的几个核心概念，这些概念是理解后续内容的基础：

索引 (Index)：类似于关系型数据库中的数据库，是文档的集合。一个索引可以包含多个类型的文档。
类型 (Type)：在 Elasticsearch 5.x 中，一个索引可以包含多个类型。类型用于在逻辑上组织文档。（注意：Elasticsearch 6.x 开始Type被标记为Deprecated，7.x 开始彻底移除Type概念，一个Index 只能包含一个Type，默认为 _doc）。
文档 (Document)：类似于关系型数据库中的行，是可被索引的基本单元。文档以 JSON 格式表示，包含多个字段。
字段 (Field)：类似于关系型数据库中的列，是文档的属性。每个字段都有自己的数据类型，例如文本 (text)、keyword、数值 (integer, float)、日期 (date) 等。
映射 (Mapping)：定义了索引中字段的类型、索引方式、分词器等信息。映射决定了数据如何被索引和搜索。
分词器 (Analyzer)：用于将文本字段分解成词项 (term) 的组件。分词器的选择直接影响搜索结果的准确性。

可以用 Mermaid 的 graph TD 图来形象地表示这些概念之间的关系：

图 1: Elasticsearch 核心概念关系图

3. Search API 基础

Elasticsearch 提供了强大的 RESTful API 用于进行各种操作，搜索与查询的核心入口就是 _search API。

基本请求结构

_search API 通常使用 HTTP GET 或 POST 方法。

GET 请求：查询参数直接附加在 URL 中。适用于简单的查询。
POST 请求：查询参数放在请求体的 JSON 中。适用于复杂的查询，更灵活和强大。

请求 URL 格式


/[index]/[_type]/_search
/[index]/_search
/_search

/[index]/[_type]/_search: 指定索引和类型进行搜索 (5.x 版本适用)。
/[index]/_search: 指定索引进行搜索 (适用于 6.x 及更高版本，5.x 版本也兼容)。
/_search: 在所有索引中进行搜索。

请求体 (Request Body)

请求体使用 JSON 格式，包含各种查询参数，最核心的是 query 参数，用于定义查询条件。

响应体 (Response Body)

响应体也是 JSON 格式，包含搜索结果的各种信息，主要包括：

took: 查询耗时，单位毫秒。
timed_out: 是否超时。
_shards: 分片信息，包括成功、失败、总数。
hits: 搜索结果的集合，包含：
- total: 总命中数。
- max_score: 最高得分。
- hits: 实际的文档结果数组，每个文档包含：
  - _index, _type, _id, _score: 文档的元数据。
  - _source: 文档的原始 JSON 数据 (默认返回)。
  - fields: 如果请求中指定了 fields 参数，则返回指定的字段值。
  - highlight: 如果请求中启用了高亮，则返回高亮片段。

一个简单的 match_all 查询示例

以下是一个使用 match_all 查询，获取 my_index 索引中所有文档的示例 (使用 POST 请求)：

请求 (Request):


POST /my_index/_search
{
  "query": {
    "match_all": {}
  }
}

响应 (Response) (简化):


{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "title": "Document Title 1",
          "content": "This is the content of document 1."
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "title": "Document Title 2",
          "content": "This is the content of document 2."
        }
      },
      // ... 更多文档
    ]
  }
}

这个简单的例子展示了 _search API 的基本用法，以及请求和响应的结构。接下来，我们将深入学习 Query DSL，构建更复杂的查询。

4. Query DSL 详解

Elasticsearch Query DSL (Domain Specific Language) 是一种基于 JSON 的强大查询语言，用于构建各种复杂的查询。Query DSL 提供了丰富的查询类型，可以满足各种搜索需求。

Query DSL 主要分为以下几类：

叶子查询语句 (Leaf Query Clauses)：针对特定字段进行特定类型的查询。例如，match 查询、term 查询、range 查询等。
复合查询语句 (Compound Query Clauses)：组合多个叶子查询语句或复合查询语句，实现更复杂的查询逻辑。例如，bool 查询、boosting 查询等。

4.1 叶子查询语句 (Leaf Query Clauses)

叶子查询语句是 Query DSL 的基础，用于针对特定字段进行特定类型的查询。我们将其进一步细分为：

全文查询 (Full-text queries)：用于在文本字段上进行全文搜索。
词项查询 (Term-level queries)：用于在结构化数据字段上进行精确匹配查询。

4.1.1 全文查询 (Full-text queries)

全文查询主要用于在文本字段 (text 类型) 上进行搜索，它们会考虑到分词器 (analyzer) 的作用，对查询字符串进行分词，然后匹配文档中的词项。

match 查询：最基本的全文查询。
```
{
  "match": {
    "<field>": "<query_value>"
  }
}
```
match 查询会根据字段的 analyzer 对 query_value 进行分词，然后只要文档字段中包含任意一个分词后的词项，就会被匹配。

示例： 搜索 title 字段包含 "Elasticsearch" 的文档。
```
POST /my_index/_search
{
  "query": {
    "match": {
      "title": "Elasticsearch tutorial"
    }
  }
}
```
match_phrase 查询：短语匹配查询。
```
{
  "match_phrase": {
    "<field>": "<query_value>"
  }
}
```
match_phrase 查询要求查询短语中的所有词项都必须出现在文档字段中，并且顺序一致，且默认词项之间允许的最大间隔 (slop) 为 0。

示例： 搜索 title 字段包含短语 "quick brown fox" 的文档。
```
POST /my_index/_search
{
  "query": {
    "match_phrase": {
      "title": "quick brown fox"
    }
  }
}
```
可以使用 slop 参数调整词项之间的最大间隔。
```
{
  "match_phrase": {
    "title": {
      "query": "brown fox quick",
      "slop": 2
    }
  }
}
```
multi_match 查询：在多个字段上执行 match 查询。
```
{
  "multi_match": {
    "query": "<query_value>",
    "fields": ["<field1>", "<field2>", ...]
  }
}
```
multi_match 查询可以指定多个字段，在这些字段上同时执行 match 查询。可以使用 type 参数指定多字段匹配的方式，例如 best_fields (默认，选择最佳匹配字段)、most_fields (选择匹配字段最多的)、cross_fields (跨字段匹配) 等。

示例： 在 title 和 content 字段中搜索 "Elasticsearch"。
```
POST /my_index/_search
{
  "query": {
    "multi_match": {
      "query": "Elasticsearch",
      "fields": ["title", "content"]
    }
  }
}
```
common_terms 查询：更专业的全文查询，用于处理常用词 (stopwords)。

common_terms 查询可以提高搜索精度和性能，特别是对于包含大量常用词的文本数据。它可以区分常用词和非常用词，并对它们采用不同的处理策略。
query_string 查询 和 simple_query_string 查询：允许使用 Lucene 查询语法来构建复杂的查询。

query_string 查询功能强大，但语法严格，如果查询语法错误，会返回错误。simple_query_string 查询语法相对宽松，容错性更好，更适合用户直接输入查询条件。

示例： 使用 query_string 查询，搜索 title 字段包含 "Elasticsearch" 或 "Kibana" 的文档。
```
POST /my_index/_search
{
  "query": {
    "query_string": {
      "default_field": "title",
      "query": "Elasticsearch OR Kibana"
    }
  }
}
```

4.1.2 词项查询 (Term-level queries)

词项查询主要用于在结构化数据字段 (keyword, integer, date 等类型) 上进行精确匹配查询。词项查询不会对查询字符串进行分词，而是直接匹配字段中的原始词项。

term 查询：精确匹配查询。
```
{
  "term": {
    "<field>": "<query_value>"
  }
}
```
term 查询要求文档字段中的词项与 query_value 完全一致。

示例： 搜索 status 字段值为 "published" 的文档。
```
POST /my_index/_search
{
  "query": {
    "term": {
      "status": "published"
    }
  }
}
```
terms 查询：多值精确匹配查询。
```
{
  "terms": {
    "<field>": ["<value1>", "<value2>", ...]
  }
}
```
terms 查询允许指定多个值，只要文档字段中的词项与其中任意一个值匹配，就会被匹配。

示例： 搜索 tags 字段包含 "elasticsearch" 或 "kibana" 的文档。
```
POST /my_index/_search
{
  "query": {
    "terms": {
      "tags": ["elasticsearch", "kibana"]
    }
  }
}
```

range 查询：范围查询。


{
  "range": {
    "<field>": {
      "gte": <lower_bound>,  // 大于等于
      "lte": <upper_bound>,  // 小于等于
      "gt": <lower_bound>,   // 大于
      "lt": <upper_bound>    // 小于
    }
  }
}

range 查询允许指定字段值的范围，可以使用 gte, lte, gt, lt 等参数指定范围的边界。

示例： 搜索 publish_date 字段在 2023-01-01 到 2023-01-31 之间的文档。


POST /my_index/_search
{
  "query": {
    "range": {
      "publish_date": {
        "gte": "2023-01-01",
        "lte": "2023-01-31"
      }
    }
  }
}

exists 查询：字段存在性查询。


{
  "exists": {
    "field": "<field>"
  }
}

exists 查询用于查找指定字段存在的文档。

示例： 搜索 tags 字段存在的文档。


POST /my_index/_search
{
  "query": {
    "exists": {
      "field": "tags"
    }
  }
}

prefix 查询：前缀查询。
```
{
  "prefix": {
    "<field>": "<prefix_value>"
  }
}
```
prefix 查询用于查找字段值以指定前缀开头的文档。

示例： 搜索 title 字段以 "Elas" 开头的文档。
```
POST /my_index/_search
{
  "query": {
    "prefix": {
      "title": "Elas"
    }
  }
}
```
注意： prefix 查询效率相对较低，特别是前缀较短时，会扫描大量倒排索引。应谨慎使用，并尽量使用更精确的查询条件。
wildcard 查询：通配符查询。
```
{
  "wildcard": {
    "<field>": "<wildcard_pattern>"
  }
}
```
wildcard 查询允许使用通配符进行模糊匹配。支持 * (匹配任意字符序列) 和 ? (匹配任意单个字符) 两种通配符。

示例： 搜索 title 字段包含 "Elasti*" 的文档。
```
POST /my_index/_search
{
  "query": {
    "wildcard": {
      "title": "Elasti*"
    }
  }
}
```
注意： wildcard 查询效率非常低，应尽量避免使用，特别是通配符在开头时，会进行全索引扫描。
regexp 查询：正则表达式查询。
```
{
  "regexp": {
    "<field>": "<regex_pattern>"
  }
}
```
regexp 查询允许使用正则表达式进行更复杂的模式匹配。

示例： 搜索 title 字段匹配正则表达式 "Elast[a-z]*ic" 的文档。
```
POST /my_index/_search
{
  "query": {
    "regexp": {
      "title": "Elast[a-z]*ic"
    }
  }
}
```
注意： regexp 查询效率也比较低，应谨慎使用，并尽量优化正则表达式。
fuzzy 查询：模糊查询。
```
{
  "fuzzy": {
    "<field>": "<query_value>"
  }
}
```
fuzzy 查询用于查找与 query_value 相似的词项，基于编辑距离算法 (Levenshtein distance)。

示例： 搜索 title 字段与 "Elastik" 相似的文档 (允许编辑距离为 2)。
```
POST /my_index/_search
{
  "query": {
    "fuzzy": {
      "title": "Elastik"
    }
  }
}
```

ids 查询：根据文档 ID 列表查询。


{
  "ids": {
    "type": "<type>",  // 5.x 版本可以指定 type
    "values": ["<id1>", "<id2>", ...]
  }
}

ids 查询用于根据指定的文档 ID 列表快速检索文档。

示例： 搜索 ID 为 "1", "2", "3" 的文档。


POST /my_index/_search
{
  "query": {
    "ids": {
      "type": "my_type", // 5.x 版本可以指定 type
      "values": ["1", "2", "3"]
    }
  }
}

4.2 复合查询语句 (Compound Query Clauses)

复合查询语句用于组合多个叶子查询语句或复合查询语句，实现更复杂的查询逻辑。

bool 查询：布尔查询，是最常用的复合查询。


{
  "bool": {
    "must": [ ... ],       // 必须匹配，贡献算分
    "should": [ ... ],     // 应该匹配，满足条件会增加算分
    "must_not": [ ... ],   // 必须不匹配，不贡献算分
    "filter": [ ... ]     // 必须匹配，但不贡献算分，常用于过滤
  }
}

bool 查询允许组合多个查询条件，使用布尔逻辑 "与 (must)", "或 (should)", "非 (must_not)"。filter 子句也用于过滤文档，但与 must 不同的是，filter 不参与算分，性能更高，常用于过滤条件。

示例： 搜索 title 字段包含 "Elasticsearch"，并且 status 字段为 "published" 的文档。


POST /my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "Elasticsearch" } }
      ],
      "filter": [
        { "term": { "status": "published" } }
      ]
    }
  }
}

should 子句与 minimum_should_match 参数

should 子句表示 "或" 的关系，满足其中一个或多个条件即可。可以使用 minimum_should_match 参数指定至少需要满足多少个 should 子句。

示例： 搜索 tags 字段包含 "elasticsearch" 或 "kibana" 或 "logstash" 中至少两个的文档。


POST /my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "tags": "elasticsearch" } },
        { "term": { "tags": "kibana" } },
        { "term": { "tags": "logstash" } }
      ],
      "minimum_should_match": 2
    }
  }
}

boosting 查询：提升 (positive) 和降低 (negative) 匹配文档的算分。


{
  "boosting": {
    "positive": { ... },  // 提升算分的查询
    "negative": { ... },  // 降低算分的查询
    "negative_boost": <value> // 降低算分的权重，0 < value < 1
  }
}

boosting 查询可以根据 positive 查询和 negative 查询的结果，调整文档的算分。negative_boost 参数用于控制降低算分的程度。

示例： 提升 title 字段包含 "Elasticsearch" 的文档算分，降低 content 字段包含 "deprecated" 的文档算分。


POST /my_index/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": { "title": "Elasticsearch" }
      },
      "negative": {
        "match": { "content": "deprecated" }
      },
      "negative_boost": 0.2
    }
  }
}

function_score 查询：更灵活的算分控制。

function_score 查询允许使用各种函数来修改文档的算分，例如：
- script_score: 使用自定义脚本计算算分。
- weight: 为每个文档指定一个权重。
- random_score: 生成随机算分。
- field_value_factor: 使用字段值作为算分因子。
- 衰减函数 (Decay Functions)：gauss, linear, exp，根据字段值与指定点的距离衰减算分。
function_score 查询提供了强大的算分控制能力，可以实现各种复杂的排序和相关性优化需求。

5. 代码实践：常用查询示例

为了更好地理解和应用 Query DSL，我们通过一些常用的代码示例来演示各种查询的用法。

环境准备

假设我们有一个名为 product_index 的索引，包含 product 类型的文档，文档结构如下：


{
  "id": 1,
  "name": "Elasticsearch Server",
  "description": "The official Elasticsearch server.",
  "price": 599.00,
  "tags": ["search", "distributed", "nosql"],
  "publish_date": "2023-10-26",
  "status": "published"
}

常用查询示例 (使用 curl 命令)

match_all 查询 (获取所有商品)


curl -X POST "localhost:9200/product_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}
'

match 查询 (搜索商品名称包含 "elasticsearch" 的商品)


curl -X POST "localhost:9200/product_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "name": "elasticsearch"
    }
  }
}
'

match_phrase 查询 (搜索商品描述包含短语 "official server" 的商品)


curl -X POST "localhost:9200/product_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_phrase": {
      "description": "official server"
    }
  }
}
'

term 查询 (搜索商品状态为 "published" 的商品)


curl -X POST "localhost:9200/product_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": {
      "status": "published"
    }
  }
}
'

terms 查询 (搜索商品标签包含 "search" 或 "nosql" 的商品)


curl -X POST "localhost:9200/product_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "terms": {
      "tags": ["search", "nosql"]
    }
  }
}
'

range 查询 (搜索价格在 500 到 600 之间的商品)


curl -X POST "localhost:9200/product_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "price": {
        "gte": 500,
        "lte": 600
      }
    }
  }
}
'

bool 查询 (组合查询：搜索商品名称包含 "elasticsearch" 且价格小于 600 的商品)


curl -X POST "localhost:9200/product_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "elasticsearch" } }
      ],
      "filter": [
        { "range": { "price": { "lt": 600 } } }
      ]
    }
  }
}
'