openGauss每日一练第20天 | openGauss全文检索

原创 lxs_data 2021-12-20

526

openGauss全文检索功能

两种数据类型用于支持全文检索。tsvector类型表示为文本搜索优化的文件格式，tsquery类型表示文本查询.

openGauss的全文检索基于匹配算子@@，当一个tsvector(document)匹配到一个tsquery(query)时，则返回true。

其中，tsvector(document)和tsquery(query)两种数据类型可以任意排序。

官网网址 https://opengauss.org/zh/docs/2.1.0/docs/Developerguide/%E5%9F%BA%E6%9C%AC%E6%96%87%E6%9C%AC%E5%8C%B9%E9%85%8D.html

分词器
全文检索功能还可以做更多事情：忽略索引某个词（停用词），处理同义词和使用复杂解析，例如：不仅基于空格的解析。这些功能通过文本搜索分词器控制。openGauss支持多语言的预定义的分词器，并且可以创建分词器（gsql的\dF命令显示了所有可用分词器）。
在安装期间选择一个合适的分词器，并且在postgresql.conf中相应的设置default_text_search_config。如果为了openGauss使用同一个文本搜索分词器可以使用postgresql.conf中的值。如果需要在openGauss中使用不同分词器，可以使用ALTER DATABASE … SET在任一数据库进行配置。用户也可以在每个会话中设置default_text_search_config。
每个依赖于分词器的文本搜索函数有一个可选的配置参数，用以明确声明所使用的分词器。仅当忽略这个参数的时候，才使用default_text_search_config。
为了更方便的建立自定义文本搜索分词器，可以通过简单的数据库对象建立分词器。 openGauss文本搜索功能提供了四种类型与分词器相关的数据库对象：
文本搜索解析器将文档分解为token，并且分类每个token（例如：词和数字）。
文本搜索词典将token转换成规范格式并且丢弃停用词。
文本搜索模板提供潜在的词典功能：一个词典指定一个模板，并且为模板设置参数。
文本搜索分词器选择一个解析器，并且使用一系列词典规范化语法分析器产生的token。
官网网址 https://opengauss.org/zh/docs/2.1.0/docs/Developerguide/%E5%88%86%E8%AF%8D%E5%99%A8.html

openGauss全文检索练习

用tsvector @@ tsquery和tsquery @@ tsvector完成两个基本文本匹配

omm=# SELECT 'a fat cat TOM '::tsvector @@ 'cat & TOM'::tsquery AS RESULT;
result
--------
t
(1 row)

omm=# SELECT 'fat & JON'::tsquery @@ 'a fat cat TOM'::tsvector AS RESULT;
omm=# result
--------
f
(1 row)

创建表且至少有两个字段的类型为 text类型，在创建索引前进行全文检索

omm=# CREATE TABLE TXT(id int, body text, title text, last_mod_date date);
CREATE TABLE
omm=# omm=#
omm=#
omm=# INSERT INTO TXT VALUES(1, 'China, officially the People''s Republic of China(PRC), located in Asia, is the world''s most populous state.', 'China', '2010-1-1');
INSERT 0 1
omm=# INSERT INTO TXT VALUES(2, 'America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley.', 'America', '2010-1-1');
INSERT 0 1
omm=# INSERT INTO TXT VALUES(3, 'England is a country that is part of the United Kingdom. It shares land borders with Scotland to the north and Wales to the west.', 'England','2010-1-1');
INSERT 0 1

omm=# select * from TXT;
id | body
| title | last_mod_date
----+---------------------------------------------------------------------------------------
--------------------------------------------+---------+---------------
1 | China, officially the People's Republic of China(PRC), located in Asia, is the world's
most populous state. | China | 2010-01-01
2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunn
ell, Dan Peek, and Gerry Beckley. | America | 2010-01-01
3 | England is a country that is part of the United Kingdom. It shares land borders with S
cotland to the north and Wales to the west. | England | 2010-01-01
(3 rows)

omm=# SELECT id, body, title FROM TXT WHERE to_tsvector(body) @@ to_tsquery('China');
----------------------+-------
1 | China, officially the People's Republic of China(PRC), located in Asia, is the world's
most populous state. | China
(1 row)

id | body
| title
----+---------

omm=# SELECT id, body, title FROM TXT WHERE to_tsvector(body) @@ to_tsquery('America');
id | body
| title
----+---------------------------------------------------------------------------------------
----------------------------------+---------
2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunn
ell, Dan Peek, and Gerry Beckley. | America
(1 row)

创建GIN索引

omm=# CREATE INDEX TXT_1 ON TXT USING gin(to_tsvector('english', body));
omm=# CREATE INDEX

omm=# CREATE INDEX TXT_2 ON TXT USING gin(to_tsvector('english', title || ' ' || body));
CREATE INDEX

omm=# \d+ TXT
Table "public.txt"
Column | Type | Modifiers | Storage | Stats target | Description
---------------+---------+-----------+----------+--------------+-------------
id | integer | | plain | |
body | text | | extended | |
title | text | | extended | |
last_mod_date | date | | plain | |
Indexes:
"txt_1" gin (to_tsvector('english'::regconfig, body)) TABLESPACE pg_default
"txt_2" gin (to_tsvector('english'::regconfig, (title || ' '::text) || body)) TABLESPACE pg_default
Has OIDs: no
Options: orientation=row, compression=no

omm=# SELECT id, body, title FROM TXT WHERE to_tsvector(body) @@ to_tsquery('America');
2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunn
ell, Dan Peek, and Gerry Beckley. | America
(1 row)

id | body
| title
----+---------------------------------------------------------------------------------------
----------------------------------+---------

清理数据

omm=#drop table txt;
DROP TABLE
omm=#

持续打卡中

opengauss 墨力计划

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者