等待 PostgreSQL 18 – 允许并行创建 GIN 索引

飞象数据 2天前

原文地址：

https://www.depesz.com/2025/03/11/waiting-for-postgresql-18-allow-parallel-create-index-for-gin-indexes/

2025 年 3 月 3 日，Tomas Vondra 提交了补丁：

Allow parallel CREATE INDEX for GIN indexes
允许并行为GIN索引创建索引


Allow using parallel workers to build a GIN index, similarly to BTREE
and BRIN. For large tables this may result in significant speedup when
the build is CPU-bound.
允许使用并行worker构建GIN索引，类似于BTREE和BRIN。对于大型表，当构建受到CPU限制时，这可能会导致显著的加速。


The work is divided so that each worker builds index entries on a subset
of the table, determined by the regular parallel scan used to read the
data. Each worker uses a local tuplesort to sort and merge the entries
for the same key. The TID lists do not overlap (for a given key), which
means the merge sort simply concatenates the two lists. The merged
entries are written into a shared tuplesort for the leader.
工作被划分，以便每个worker在表的一个子集上构建索引条目，该子集由用于读取数据的常规并行扫描确定。每个worker都使用本地元组排序来对同一个键的条目进行排序和合并。TID列表不重叠（对于给定的键），这意味着合并排序只是将两个列表连接起来。合并的条目被写入领导者的共享元组排序中。


The leader needs to merge the sorted entries again, before writing them
into the index. But this way a significant part of the work happens in
the workers, and the leader is left with merging fewer large entries,
which is more efficient.
领导者需要在将排序后的条目写入索引之前再次合并它们。但通过这种方式，很大一部分工作发生在工人身上，领导者只需合并更少的大条目，这会更有效率。


Most of the parallelism infrastructure is a simplified copy of the code
used by BTREE indexes, omitting the parts irrelevant for GIN indexes
(e.g. uniqueness checks).
大多数并行基础设施是BTREE索引使用的代码的简化副本，省略了与GIN索引无关的部分（例如唯一性检查）。


Original patch by me, with reviews and substantial improvements by
Matthias van de Meent, certainly enough to make him a co-author.
我的原始补丁，Matthias van de Meent的评论和实质性改进，当然足以让他成为合著者。


Author: Tomas Vondra, Matthias van de Meent
Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke
Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com

GIN 索引很棒，但创建速度相当慢。因此，在某些情况下，这种变化有可能真正改变游戏规则。

让我们试着测试一下。我需要一些可以使用 GIN 搜索的数据，所以让我们编写一些数组：

#!/usr/bin/env ruby
# frozen_string_literal: true
arr = (1..100_000).to_a
1_000_000.times do
  puts "{#{arr.sample(rand() * 100 + 50).join(',')}}"
end

这将输出 100 万个随机数组，其中每个数组都有随机数量的元素（从 50 到 149），其中每个元素的数字从 1 到 100,000。将此输出保存到 tmp/test.arrays.txt，大小约为560MB。

现在，我可以创建一个表，并加载数据：

=$ create table test_gin (
    the_array int4[]
);
=$ \copy test_gin from /tmp/test.arrays.txt

之后，该表为444MB，如\dt+ test_gin
展示的那样。

因此，我确保pg不会使用并行处理，并创建了索引：

=$ set max_parallel_maintenance_workers = 0;
SET


=$ create index qq on test_gin using gin (the_array);
CREATE INDEX
Time: 77111.482 ms (01:17.111)

当然，我们应该看到索引现在用于查询：

=$ explain (analyze, buffers) select * from test_gin where the_array @> '{1,2}'::int4[];
                                                    QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on test_gin  (cost=51.31..150.05 rows=25 width=421) (actual time=0.093..0.094 rows=1.00 loops=1)
   Recheck Cond: (the_array @> '{1,2}'::integer[])
   Heap Blocks: exact=1
   Buffers: shared hit=4 read=6
   I/O Timings: shared read=0.017
   ->  Bitmap Index Scan on qq  (cost=0.00..51.30 rows=25 width=0) (actual time=0.082..0.082 rows=1.00 loops=1)
         Index Cond: (the_array @> '{1,2}'::integer[])
         Buffers: shared hit=4 read=5
         I/O Timings: shared read=0.013
 Planning:
   Buffers: shared hit=12 read=6 dirtied=4
   I/O Timings: shared read=0.550
 Planning Time: 0.764 ms
 Execution Time: 0.112 ms
(14 rows)

检查其健全性、大小：\di+ qq
显示的索引大小为 1099 MB。

现在，看是否可以并行处理：

=$ drop index qq;
DROP INDEX


=$ set max_parallel_maintenance_workers = 10;
SET


=$ create index qq on test_gin using gin (the_array);
CREATE INDEX
Time: 53602.822 ms (00:53.603)

在索引创建过程中，ps 输出显示：

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
…
pgdba     361901 10.2  0.3 282400 209020 ?       Rs   12:41   2:41      \_ postgres: depesz depesz [local] CREATE INDEX
pgdba     391026 99.4  0.0 270036 60428 ?        Rs   13:07   0:29      \_ postgres: parallel worker for PID 361901
…

这表明只有 2 个并行工作程序（主进程和一个 extra）。时间减少了大约 30%，这还不错。奇怪的是：

$ \di+ qq
                                        List of indexes
 Schema │ Name │ Type  │ Owner  │  Table   │ Persistence │ Access method │  Size  │ Description
────────┼──────┼───────┼────────┼──────────┼─────────────┼───────────────┼────────┼─────────────
 public │ qq   │ index │ depesz │ test_gin │ permanent   │ gin           │ 769 MB │ [null]
(1 row)

索引更小了？！不知道这是怎么发生的，这让我有点焦虑，但我确实验证了索引工作正常：

=$ explain (analyze, buffers) select * from test_gin where the_array @> '{1,2}'::int4[];
                                                    QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan on test_gin  (cost=34.31..133.05 rows=25 width=421) (actual time=0.152..0.153 rows=1.00 loops=1)
   Recheck Cond: (the_array @> '{1,2}'::integer[])
   Heap Blocks: exact=1
   Buffers: shared hit=4 read=6
   I/O Timings: shared read=0.026
   ->  Bitmap Index Scan on qq  (cost=0.00..34.30 rows=25 width=0) (actual time=0.143..0.144 rows=1.00 loops=1)
         Index Cond: (the_array @> '{1,2}'::integer[])
         Buffers: shared hit=4 read=5
         I/O Timings: shared read=0.023
 Planning:
   Buffers: shared hit=5 read=2 dirtied=1
   I/O Timings: shared read=0.015
 Planning Time: 0.200 ms
 Execution Time: 0.176 ms
(14 rows)

也许更聪明的人能够告诉我为什么没有并行度的索引更大，但无论如何 - 它看起来很棒。

感谢所有让这一切成为可能的人。

数据库飞象数据库 postgresql

文章转载自飞象数据，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。