暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Elasticsearch Analysis 分析器

锐玩道 2021-08-17
452

如果❤️我的文章有帮助,欢迎点赞、关注。这是对我继续技术创作最大的鼓励。更多往期文章在我的个人专栏[1]

Elasticsearch Analysis 分析器

Analysis —文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词Analysis是通过Analyzer来实现的

可使用 Elasticsearch 内置的分析器/或者按需定制化分析器除了在数据写入时转换词条,匹配Query语句时候也需要用相同的分析器对查询语句进行分析


Analyzer 分析器组成

分词器是专门处理分词的组件,由三部分组成

Character Filters(针对原始文本处理,例如去除HTML)Tokenizer 安装规则分词Token Filter 将切分的单词进行加工、小写,删除stopwords,增加同义词

使用 Analyzer 分析器进行分词

analyzer 分析器:

Simple Analyzer – 按照非字母切分(符号被过滤),小写处理Stop Analyzer – 小写处理,停用词过滤(the,a,is)Whitespace Analyzer – 按照空格切分,不转小写Keyword Analyzer – 不分词,直接将输入当作输出Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)Language – 提供了30多种常见语言的分词器

查看不同 analyzer 分析器的效果

standard 标准分析器(默认)

    GET _analyze
    {
    "analyzer": "standard",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }


    =================== 结果 V ===================
    {
    "tokens" : [
    {
    "token" : "2",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<NUM>",
    "position" : 0
    },
    {
    "token" : "running",
    "start_offset" : 2,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 1
    },
    ......
    {
    "token" : "evening",
    "start_offset" : 62,
    "end_offset" : 69,
    "type" : "<ALPHANUM>",
    "position" : 12
    }
    ]
    }

    Stop Analyzer – 小写处理,停用词过滤

      GET _analyze
      {
      "analyzer": "stop",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
      }


      =================== 结果 V ===================
      {
      "tokens" : [
      {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
      },
      {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
      },
      ......
      {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
      }
      ]
      }

      更多分词器例子

        #simpe
        GET _analyze
        {
        "analyzer": "simple",
        "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
        }




        GET _analyze
        {
        "analyzer": "stop",
        "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
        }




        #stop
        GET _analyze
        {
        "analyzer": "whitespace",
        "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
        }


        #keyword
        GET _analyze
        {
        "analyzer": "keyword",
        "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
        }


        GET _analyze
        {
        "analyzer": "pattern",
        "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
        }




        #english
        GET _analyze
        {
        "analyzer": "english",
        "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
        }




        POST _analyze
        {
        "analyzer": "icu_analyzer",
        "text": "他说的确实在理”"
        }




        POST _analyze
        {
        "analyzer": "standard",
        "text": "他说的确实在理”"
        }




        POST _analyze
        {
        "analyzer": "icu_analyzer",
        "text": "这个苹果不大好吃"
        }


        需要注意的是, icu_analyzer
         分析器; 包括 ik
         分析器; 并非 Elasticsearch 7.8.0 自带分析器.
        需要执行命令:./bin/elasticsearch-plugin install analysis-icu
         自行安装并重启 elasticsearch 才能使用

        中文分词

        ik

        支持自定义词库,支持热更新分词 https://gitee.com/mirrors/elasticsearch-analysis-ik?_from=gitee_search

        THULAC

        清华大学自然语言处理和社会人文计算实验室的一套中文分词器 https://gitee.com/puremilk/THULAC-Python?_from=gitee_search

        相关阅读

        https://www.elastic.co/guide/en/elasticsearch/reference/7.1/indices-analyze.htmlhttps://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html

        References

        [1]
         更多往期文章在我的个人专栏: https://coderdao.github.io/


        文章转载自锐玩道,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

        评论