[译]如何使用tsearch索引数据

原创 Maleah 2022-03-13

2167

原文地址：https://www.depesz.com/2022/03/01/how-to-index-data-with-tsearch/

原文作者： depesz

翻译：Maleah

最近有人在Slack上问我，是否应该在某个字段上构建tsvector数据类型并且创建索引，或者对表达式进行索引。

经过讨论，我想我会把我的想法重新格式化为博文，避免下次重新输入。

对于初学者，我不推荐使用tsvector列。他并没有真正起到作用。当然，你可以在它上面使用索引，但是你也可以在to_tsvector表达式上使用索引。

他就在那里，用完磁盘空间。

但是，在表达式上使用索引是复杂的。

让我们假设有一张表：

create table posts (
    id int8 generated always as identity primary key,
    title text,
    headline text,
    main_body text
);

可以像这样创建索引：

create index fts_index on posts using gin (
    to_tsvector( 'english'::regconfig, title || headline || main_body )
);

但这将会导致误报。例如，假如title使用use单词结尾，headline使用fully开始，即使没有usefully单词，tsvector也会包含。

调整：

create index fts_index on posts using gin (
    to_tsvector( 'english'::regconfig, title || ' ' || headline || ' ' || main_body )
);

结束。但是这样就可以吗？headline是null呢？与null连接的任何字符串都变为null。当然我们可以使用coalesce：

create index fts_index on posts using gin (
    to_tsvector( 'english'::regconfig, coalesce(title, '') || ' ' || coalesce(headline, '') || ' ' || coalesce(main_body, '') )
);

他会起作用。但是，问题在于它不是很好，由于使用索引，查询必须重复精确的表达式。

所以，每个查询将会重复to_tsvector( 'english'::regconfig, coalesce(title, '') || ' ' || coalesce(headline, '') || ' ' || coalesce(main_body, '') )。如果选择使用search weights之类的东西，想象一下chaging/typing查询的乐趣，表达式会变得越来越长。

将所有逻辑保留在触发器和更新字段中，然后使用该字段搜索看起来是明智的。但是或许有另一种方式呢？

答案是肯定的。我可以创建一个函数fts，函数将会接受一个参数，这个参数是posts表中的整行。基于这行将会返回tsvector：

create function fts( IN p_post posts )
    returns tsvector
    language sql
    immutable
    as $$
select to_tsvector( 'english'::regconfig, coalesce(p_post.title, '') || ' ' || coalesce(p_post.headline, '') || ' ' || coalesce(p_post.main_body, '') )
$$;

有了这个，我可以：

create index fts_index on posts using gin ( fts(posts) );

他将会起作用：

$ \d posts
                           Table "public.posts"
  Column   │  Type  │ Collation │ Nullable │           Default            
───────────┼────────┼───────────┼──────────┼──────────────────────────────
 id        │ bigint │           │ not null │ generated always as identity
 title     │ text   │           │          │ 
 headline  │ text   │           │          │ 
 main_body │ text   │           │          │ 
Indexes:
    "posts_pkey" PRIMARY KEY, btree (id)
    "fts_index" gin (fts(posts.*))
 
$ explain select * from posts p where fts(p) @@ plainto_tsquery('depesz');
                                                                                                        QUERY PLAN                                                                                                        
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on posts p  (cost=63.00..17914.45 rows=5000 width=78)
   Recheck Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
   ->  Bitmap Index Scan on fts_index  (cost=0.00..61.75 rows=5000 width=0)
         Index Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
(4 rows)

请关注fts_index扫描。

还有一个很cool的事情。它相对晦涩，但是在这种情况下很有帮助。

通常情况下，如果在函数中传入单个参数，这个参数是输入到一张表中（技术上是任何复合类型，但是对于现在来说不重要），你可以在查询中使用它作为一个virtual字段。

这个语法已经记录在案，但是找到它并不容易。如果你想要更多的阅读，转到docs on function calls ，阅读Note，点击链接docs on “Using Composite Types in Queries"。

在这里你也可以看到他的使用：

$ explain select p.* from posts p where p.fts @@ plainto_tsquery('depesz');
                                                                                                        QUERY PLAN                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on posts p  (cost=63.00..17911.17 rows=4999 width=469)
   Recheck Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
   ->  Bitmap Index Scan on fts_index  (cost=0.00..61.75 rows=4999 width=0)
         Index Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
(4 rows)

请关注我使用的语法：*p.fts @@* and not *fts(p) @@*。

你或许想使用这个功能，或者只是正常调用(fts(p))，不管哪种方式，在所有的查询中编写都是一个更简单的方式，可以在日常的查询中，轻松地“隐藏”构建tsvector的复杂性。

当然，如果你选择这种方式，在一段时间后，决定想要改变fts()函数的定义（例如，增加其他的字段，或者权重），你将需要在此基础上重建索引，因为索引所以不会意识到函数的定义已经被改变。

请注意，你可以有许多相同名字的函数，这些不同的函数基于不同的参数类型。这意味着如果你需要在更多的表上增加tsearch索引，你可以增加更多的不同参数类型的fts()函数，在任何地方重用fts（或者调用这些函数）。

某一天我们可能得到基于表达式的生成列，生成列不需要被保存，这将会是解决问题的更好的方式，但是现在，简单的函数或者基于函数的索引似乎（对我来说）是最干净的解决方法。

希望这篇文章将会帮助到一些人🙂。

2022-03-01

墨力计划 postgresql

最后修改时间：2022-03-13 10:35:25

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

PostgreSQL 杂谈（共22篇）

日常学习的关于 PostgreSQL 数据库的零碎知识

墨力翻译计划优秀译文（共111篇）

分享墨力翻译计划优秀文章，欢迎加小墨（modb666）加入翻译大家庭。

[译]如何使用tsearch索引数据

文章被以下合辑收录

评论