08_Elasticsearch操作_复合查询

lin在路上 2020-05-12

363

前面讲的是单个查询语句，接下来要介绍复合查询，复合查询包括bool query、boosting query、constant_score、dis_max、multi_match、function_score。其中后面五个bool查询的补充，用于解决bool查询无法解决的特定问题。

1 概述

bool query

通过布尔逻辑将多个查询组合复杂的查询。
可以实现must和filter(必须匹配)、should(选择匹配)、must_not(必须不能匹配)以及通过嵌套实现的should_not

boosting query

通过控制部分查询条件的权重，从而控制返回文档的排序

constant_score

返回指定的score，一般结合filter使用，因为filter context忽略score

dis_max

选择匹配，相关度算分取分数最高的那个query的分数

multi_match

基于match的基础上，组合多个查询条件
支持最佳字段(Best Fields)、多数字段(Most Fields)、混合字段(Cross Field)等

function_scroe

调用函数对每一个匹配文档的相关度算分进行重新计算，根据新生成的分数进行排序
支持weight(weight * _score)、field_value_factor(使用文档中某个字段的值来改变_score)、random_score(使用随机数来)

2 Bool query

bool查询特点

由一个或多个子查询组合而成，子查询可以任意顺序出现
可以嵌套多个查询，包括bool查询
如果bool查询中没有must条件，should条件中必须至少满足一条才会返回结果
支持must、should、must_not、filter四种

must--必须匹配，Query Context，贡献算分
should--选择匹配，Query Context，贡献算分
must_not--必须不能匹配，Filter Context，不贡献算分
filter--必须匹配，Filter Context，不贡献算分

Query Context VS Filter Context

Query Context

相关性算分

Filter Context

不需要算分(Yes or No),可以利用Cache,获得更好的性能

大纲

基本语法
针对数组查询的优化
should not实现
权重控制

通过bool嵌套来控制权重
通过boost参数来控制权重

Query Context VS Filter Context

2.1 基本语法

 POST /products/_bulk
 { "index": { "_id": 1 }}
 { "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
 { "index": { "_id": 2 }}
 { "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
 { "index": { "_id": 3 }}
 { "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
 { "index": { "_id": 4 }}
 { "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }
 
 
 
 #基本语法
 POST /products/_search
 {
   "query": {
     "bool" : {
       "must" : {
         "term" : { "price" : "30" }
       },
       "filter": {
         "term" : { "avaliable" : "true" }
       },
       "must_not" : {
         "range" : {
           "price" : { "lte" : 10 }
         }
       },
       "should" : [
         { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
         { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
       ],
       "minimum_should_match" :1
     }
   }
 }
 #四种子句可自由组合,每种类型里面可以添加多个查询
 #如果没有must条件,则should中必须至少满足一条查询

2.2 针对数组查询的优化

由于针对数组的查询是包含,而不是等于.为实现精确查询需要添加字段来过滤

 #改变数据模型，增加字段。解决数组包含而不是精确匹配的问题
 POST /newmovies/_bulk
 { "index": { "_id": 1 }}
 { "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy","genre_count":1 }
 { "index": { "_id": 2 }}
 { "title" : "Dave","year":1993,"genre":["Comedy","Romance"],"genre_count":2 }
 
 #must，有算分
 POST /newmovies/_search
 {
   "query": {
     "bool": {
       "must": [
         {"term": {"genre.keyword": {"value": "Comedy"}}},
         {"term": {"genre_count": {"value": 1}}}
 
       ]
     }
   }
 }
 
 #Filter。不参与算分，结果的score是0
 POST /newmovies/_search
 {
   "query": {
     "bool": {
       "filter": [
         {"term": {"genre.keyword": {"value": "Comedy"}}},
         {"term": {"genre_count": {"value": 1}}}
         ]
     }
   }
 }

2.3 should not实现

 POST /products/_search
 {
   "query": {
     "bool": {
       "must": {
         "term": {
           "price": "30"
         }
       },
       "should": [
         {
           "bool": {
             "must_not": {
               "term": {
                 "avaliable": "false"
               }
             }
           }
         }
       ],
       "minimum_should_match": 1
     }
   }
 }
 #通过嵌套should和must_not，实现了should not逻辑

2.4 权重控制

通过嵌套控制权限

 POST /animals/_search
 {
   "query": {
     "bool": {
       "should": [
         { "term": { "text": "brown" }},
         { "term": { "text": "red" }},
         { "term": { "text": "quick"   }},
         { "term": { "text": "dog"   }}
       ]
     }
   }
 }
 
 POST /animals/_search
 {
   "query": {
     "bool": {
       "should": [
         { "term": { "text": "quick" }},
         { "term": { "text": "dog"   }},
         {
           "bool":{
             "should":[
                { "term": { "text": "brown" }},
                { "term": { "text": "red" }}
             ]
           }
         }
       ]
     }
   }
 }
 #同一层级下的竞争字段，具有相同的权重
 #通过嵌套bool查询，可以改变对算分的影响

通过boost参数来控制权重

 POST /blogs/_bulk
 { "index": { "_id": 1 }}
 {"title":"Apple iPad", "content":"Apple iPad,Apple iPad" }
 { "index": { "_id": 2 }}
 {"title":"Apple iPad,Apple iPad", "content":"Apple iPad" }
 
 POST blogs/_search
 {
   "query": {
     "bool": {
       "should": [
         {"match": {
           "title": {
             "query": "apple,ipad",
             "boost": 1.1
           }
         }},
         {"match": {
           "content": {
             "query": "apple,ipad",
             "boost":1.5
           }
         }}
       ]
     }
   }
 }
 #boosting是控制相关度的参数，对索引、字段和查询子条件都有效
 #boost值越大权重越高

2.5 Filter Context VS Query Context

Filter Context 不影响算分

 #Filtering Context
 POST /products/_search
 {
   "query": {
     "bool" : {
       "filter": {
         "term" : { "avaliable" : "true" }
       },
       "must_not" : {
         "range" : {
           "price" : { "lte" : 10 }
         }
       }
     }
   }
 }

Query Context 影响算分

 POST /products/_search
 {
   "query": {
     "bool": {
       "should": [
         {
           "term": {
             "productID.keyword": {
               "value": "JODL-X-1937-#pV7"}}
         },
         {"term": {"avaliable": {"value": true}}
         }
       ]
     }
   }
 }

3 boosting query

bool中的must和must_not组合，我们可以剔除不想要的文档；但这样往往会导致有效文档被误删。而通过boosting query我们可以降低不相关词语的权重(而不是过滤)，从而在保证结果全的前提下，将更符合条件的文档放到前面

 POST /news/_bulk
 { "index": { "_id": 1 }}
 { "content":"Apple Mac" }
 { "index": { "_id": 2 }}
 { "content":"Apple iPad" }
 { "index": { "_id": 3 }}
 { "content":"Apple employee like Apple Pie and Apple Juice" }
 
 
 POST news/_search
 {
   "query": {
     "bool": {
       "must": {
         "match":{"content":"apple"}
       }
     }
   }
 }
 
 POST news/_search
 {
   "query": {
     "bool": {
       "must": {
         "match":{"content":"apple"}
       },
       "must_not": {
         "match":{"content":"pie"}
       }
     }
   }
 }
 #直接过滤掉
 
 POST news/_search
 {
   "query": {
     "boosting": {
       "positive": {
         "match": {
           "content": "apple"
         }
       },
       "negative": {
         "match": {
           "content": "pie"
         }
       },
       "negative_boost": 0.5
     }
   }
 }
 #需要同时搭配三个关键字positive、negative、negative_boost，只有匹配positive查询的文档才会被包含在结果集中；但是同时匹配negative将会被降低相关度
 返回匹配positive查询的文档，同时降低negative中条件的相关性。
 #用于降低某些文档排序但是不从结果中排除

4 constant_score

bool查询中filter查询不提供算分，当场景种又需要给反馈的文档打分的时候。可以通过constant_score给文档赋个固定的算分

 POST /products/_search
 {
   "query": {
     "constant_score": {
       "filter": {
         "match" : { "avaliable" : true }
       },
     "boost":2.5
     }
   }
 }
 #符合条件的文档，score都为2.5分

5 dis_max

bool查询中的should查询中score，结果排序会更依赖于多个查询条件之和；但是业务中有时候只需要展示相关度最高的子查询，这个时候should就不能很好的满足要求。这个时候就需要用到dis_max，它只会取多个子查询中算分最高的返回。

should

查询should语句中的两个查询
加和两个查询的算分
乘以匹配语句的总数
除以所有语句的总数

dis_max

查询should语句中的两个查询
获取字段上最匹配的算分做为最终算分返回

 PUT /blogs/_doc/1
 {
     "title": "Quick brown rabbits",
     "body":  "Brown rabbits are commonly seen."
 }
 
 PUT /blogs/_doc/2
 {
     "title": "Keeping pets healthy",
     "body":  "My quick brown fox eats rabbits on a regular basis."
 }
 
 POST /blogs/_search
 {
     "query": {
         "bool": {
             "should": [
                 { "match": { "title": "Brown fox" }},
                 { "match": { "body":  "Brown fox" }}
             ]
         }
     }
 }
 
 POST blogs/_search
 {
     "query": {
         "dis_max": {
             "queries": [
                 { "match": { "title": "Quick pets" }},
                 { "match": { "body":  "Quick pets" }}
             ]
         }
     }
 }
 
 
 POST blogs/_search
 {
     "query": {
         "dis_max": {
             "queries": [
                 { "match": { "title": "Quick pets" }},
                 { "match": { "body":  "Quick pets" }}
             ],
             "tie_breaker": 0.2
         }
     }
 }
 #tie_breaker用于其他匹配语句(除最佳匹配之外)的权重系数
 #1、获取最佳匹配语句算分_score；2、获取其他匹配语句的算分与tie_breaker相乘；3、对以上算分求和并规范化

6 multi_match

与dis_max类似，同样用于补充should的使用场景。主要包括最佳字段(Best Fields)、多数字段(Most Fields)、混合字段(Cross Field)

最佳字段(Best Fields)

当字段之间相互竞争，又相互关联。例如title和body这样的字段。算分来自最匹配字段

多数字段(Most Fields)

处理英文内容时：一种常见的手段是，在主字段（english Analyzer），抽取词干，加入同义词，以匹配更多的文档。相同的文本，加入子字段（Standard Analyzer），以提供更加精确的匹配。其他字段作为匹配文档提高相关度的信号。匹配字段越多越好
无法设置参数operator；可以使用copy_to解决，但是需要额外的存储空间

混合字段(Cross Field)

对于某些实体，例如人名，地址，图书信息。需要在多个字段中确定信息，单个字段只能作为整体的一部分。希望在任何这些列出的字段找到尽可能多的词
支持operator，与copy_to相比，其中一个优势就是它可以在搜索时为单个字段提升权重

 POST blogs/_search
 {
   "query": {
     "multi_match": {
       "type": "best_fields",
       "query": "Quick pets",
       "fields": ["title","body"],
       "tie_breaker": 0.2,
       "minimum_should_match": "20%"
     }
   }
 }
 #tie_breaker 针对非最佳匹配的权限设置
 #minimum_should_match最低比中设置
 
 
 POST books/_search
 {
     "multi_match": {
         "query":  "Quick brown fox",
         "fields": "*_title"
     }
 }
 
 
 POST books/_search
 {
     "multi_match": {
         "query":  "Quick brown fox",
         "fields": [ "*_title", "chapter_title^2" ]
     }
 }
 
 
 
 DELETE /titles
 PUT /titles
 {
     "settings": { "number_of_shards": 1 },
     "mappings": {
         "my_type": {
             "properties": {
                 "title": {
                     "type":     "string",
                     "analyzer": "english",
                     "fields": {
                         "std":   {
                             "type":     "string",
                             "analyzer": "standard"
                         }
                     }
                 }
             }
         }
     }
 }
 
 
 PUT /titles
 {
   "mappings": {
     "properties": {
       "title": {
         "type": "text",
         "analyzer": "english"
       }
     }
   }
 }
 
 POST titles/_bulk
 { "index": { "_id": 1 }}
 { "title": "My dog barks" }
 { "index": { "_id": 2 }}
 { "title": "I see a lot of barking dogs on the road " }
 
 
 GET titles/_search
 {
   "query": {
     "match": {
       "title": "barking dogs"
     }
   }
 }
 #英文分词器，导致精确度减低，时态信息丢失
 
 DELETE /titles
 PUT /titles
 {
   "mappings": {
     "properties": {
       "title": {
         "type": "text",
         "analyzer": "english",
         "fields": {"std": {"type": "text","analyzer": "standard"}}
       }
     }
   }
 }
 
 POST titles/_bulk
 { "index": { "_id": 1 }}
 { "title": "My dog barks" }
 { "index": { "_id": 2 }}
 { "title": "I see a lot of barking dogs on the road " }
 
 GET /titles/_search
 {
    "query": {
         "multi_match": {
             "query":  "barking dogs",
             "type":   "most_fields",
             "fields": [ "title", "title.std" ]
         }
     }
 }
 #使用广度匹配字段title经可能多获取匹配的文档；使用title.std将相关度高的文档放前面
 
 
 GET /titles/_search
 {
    "query": {
         "multi_match": {
             "query":  "barking dogs",
             "type":   "most_fields",
             "fields": [ "title^10", "title.std" ]
         }
     }
 }
 #提升指定字段title的权重
 
 
 
 GET /titles/_search
 {
    "query": {
         "multi_match": {
             "query":  "barking dogs",
             "type":   "cross_fields",
             "operator":"and",
             "fields": [ "street", "city","country","postcode" ]
         }
     }
 }

7 function_score

以上的算分都来自于查询过程中，为实现查询更灵活的算分支持。这边引入function_score，可以在查询结束后，对每个匹配的文档进行一系列的重新算分，根据新生成的分数进行排序

默认的计算分值的函数

Weight 为每一个文档设置一个简单而不被规范化的权重
Field Value Factor 使用该数值来修改_score，例如将“热度”和“点赞数”作为算分的参考因素
Random Score 为每一个用户使用一个不同的，随机算分结果
衰减函数以某个字段的值为标准，距离某个值越近，得分越高
Script Score 自定义脚本完全控制所需逻辑

 PUT /blogs/_doc/1
 {
   "title":   "About popularity",
   "content": "In this post we will talk about...",
   "votes":   0
 }
 
 PUT /blogs/_doc/2
 {
   "title":   "About popularity",
   "content": "In this post we will talk about...",
   "votes":   100
 }
 
 PUT /blogs/_doc/3
 {
   "title":   "About popularity",
   "content": "In this post we will talk about...",
   "votes":   1000000
 }
 
 
 POST /blogs/_search
 {
   "query": {
     "function_score": {
       "query": {
         "multi_match": {
           "query":    "popularity",
           "fields": [ "title", "content" ]
         }
       },
       "field_value_factor": {
         "field": "votes"
       }
     }
   }
 }
 #希望能够将点赞多的blog，放在搜索列表相对靠前的位置。同时搜索的评分还是要作为排序的主要依据
 #新的评分=旧评分*投票数(votes字段的值)
 #若投票数为0或者很大，将导致旧评分权重极低
 
 POST /blogs/_search
 {
   "query": {
     "function_score": {
       "query": {
         "multi_match": {
           "query":    "popularity",
           "fields": [ "title", "content" ]
         }
       },
       "field_value_factor": {
         "field": "votes",
         "modifier": "log1p"
       }
     }
   }
 }
 #modifier 使用该函数用来平滑曲线。
 #log1p: 新的算分 = 老的算分 * log(1+投票数)
 #还支持none、log、log2p、ln、ln1p、ln2p、square、sqrt、reciprocal等
 
 
 POST /blogs/_search
 {
   "query": {
     "function_score": {
       "query": {
         "multi_match": {
           "query":    "popularity",
           "fields": [ "title", "content" ]
         }
       },
       "field_value_factor": {
         "field": "votes",
         "modifier": "log1p" ,
         "factor": 0.1
       }
     }
   }
 }
 #factor用于调整权重
 #新的算分=旧算分 * log(1+factor*投票数)
 
 POST /blogs/_search
 {
   "query": {
     "function_score": {
       "query": {
         "multi_match": {
           "query":    "popularity",
           "fields": [ "title", "content" ]
         }
       },
       "field_value_factor": {
         "field": "votes",
         "modifier": "log1p" ,
         "factor": 0.1
       },
       "boost_mode": "sum",
       "max_boost": 3
     }
   }
 }
 #boost_mode 用于调整计算方式。Multiply:算分与函数值的乘积(默认)；sum：算分与函数的和；Min/Max：算分与函数取最小/最大值；Replace:使用函数值取代算分。
 #max_boost 可以将算分控制在一个最大值
 
 POST /blogs/_search
 {
   "query": {
     "function_score": {
       "random_score": {
         "seed": 911119
       }
     }
   }
 }
 #使用场景：网站的广告需要提高展现率
 #具体需求：让每个用户能看到不同的随机排名，但是也希望同一个用户访问时，结果保持一致(Consistetly Random)。

数据库 elasticsearch

文章转载自lin在路上，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

08_Elasticsearch操作_复合查询

1 概述

2 Bool query

2.1 基本语法

2.2 针对数组查询的优化

2.3 should not实现

2.4 权重控制

2.5 Filter Context VS Query Context

3 boosting query

4 constant_score

5 dis_max

6 multi_match

7 function_score

评论