如果❤️我的文章有帮助,欢迎点赞、关注。这是对我继续技术创作最大的鼓励。更多往期文章在我的个人专栏[1]
Elasticsearch Analysis 分析器
•Analysis —文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词•Analysis是通过Analyzer来实现的
•可使用 Elasticsearch 内置的分析器/或者按需定制化分析器•除了在数据写入时转换词条,匹配Query语句时候也需要用相同的分析器对查询语句进行分析
Analyzer 分析器组成
分词器是专门处理分词的组件,由三部分组成
•Character Filters(针对原始文本处理,例如去除HTML)•Tokenizer 安装规则分词•Token Filter 将切分的单词进行加工、小写,删除stopwords,增加同义词
使用 Analyzer 分析器进行分词
analyzer 分析器:
•Simple Analyzer – 按照非字母切分(符号被过滤),小写处理•Stop Analyzer – 小写处理,停用词过滤(the,a,is)•Whitespace Analyzer – 按照空格切分,不转小写•Keyword Analyzer – 不分词,直接将输入当作输出•Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)•Language – 提供了30多种常见语言的分词器
查看不同 analyzer 分析器的效果
standard 标准分析器(默认)
GET _analyze{"analyzer": "standard","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}=================== 结果 V ==================={"tokens" : [{"token" : "2","start_offset" : 0,"end_offset" : 1,"type" : "<NUM>","position" : 0},{"token" : "running","start_offset" : 2,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 1},......{"token" : "evening","start_offset" : 62,"end_offset" : 69,"type" : "<ALPHANUM>","position" : 12}]}
Stop Analyzer – 小写处理,停用词过滤
GET _analyze{"analyzer": "stop","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}=================== 结果 V ==================={"tokens" : [{"token" : "running","start_offset" : 2,"end_offset" : 9,"type" : "word","position" : 0},{"token" : "quick","start_offset" : 10,"end_offset" : 15,"type" : "word","position" : 1},......{"token" : "evening","start_offset" : 62,"end_offset" : 69,"type" : "word","position" : 11}]}
更多分词器例子
#simpeGET _analyze{"analyzer": "simple","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}GET _analyze{"analyzer": "stop","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}#stopGET _analyze{"analyzer": "whitespace","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}#keywordGET _analyze{"analyzer": "keyword","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}GET _analyze{"analyzer": "pattern","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}#englishGET _analyze{"analyzer": "english","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."}POST _analyze{"analyzer": "icu_analyzer","text": "他说的确实在理”"}POST _analyze{"analyzer": "standard","text": "他说的确实在理”"}POST _analyze{"analyzer": "icu_analyzer","text": "这个苹果不大好吃"}
需要注意的是,
icu_analyzer
分析器; 包括ik
分析器; 并非 Elasticsearch 7.8.0 自带分析器.
需要执行命令:./bin/elasticsearch-plugin install analysis-icu
自行安装并重启 elasticsearch 才能使用
中文分词
ik
支持自定义词库,支持热更新分词 https://gitee.com/mirrors/elasticsearch-analysis-ik?_from=gitee_search
THULAC
清华大学自然语言处理和社会人文计算实验室的一套中文分词器 https://gitee.com/puremilk/THULAC-Python?_from=gitee_search
相关阅读
•https://www.elastic.co/guide/en/elasticsearch/reference/7.1/indices-analyze.html•https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
References
[1]
更多往期文章在我的个人专栏: https://coderdao.github.io/




