原文地址:
https://www.depesz.com/2025/03/11/waiting-for-postgresql-18-allow-parallel-create-index-for-gin-indexes/
2025 年 3 月 3 日,Tomas Vondra 提交了补丁:
Allow parallel CREATE INDEX for GIN indexes允许并行为GIN索引创建索引Allow using parallel workers to build a GIN index, similarly to BTREEand BRIN. For large tables this may result in significant speedup whenthe build is CPU-bound.允许使用并行worker构建GIN索引,类似于BTREE和BRIN。对于大型表,当构建受到CPU限制时,这可能会导致显著的加速。The work is divided so that each worker builds index entries on a subsetof the table, determined by the regular parallel scan used to read thedata. Each worker uses a local tuplesort to sort and merge the entriesfor the same key. The TID lists do not overlap (for a given key), whichmeans the merge sort simply concatenates the two lists. The mergedentries are written into a shared tuplesort for the leader.工作被划分,以便每个worker在表的一个子集上构建索引条目,该子集由用于读取数据的常规并行扫描确定。每个worker都使用本地元组排序来对同一个键的条目进行排序和合并。TID列表不重叠(对于给定的键),这意味着合并排序只是将两个列表连接起来。合并的条目被写入领导者的共享元组排序中。The leader needs to merge the sorted entries again, before writing theminto the index. But this way a significant part of the work happens inthe workers, and the leader is left with merging fewer large entries,which is more efficient.领导者需要在将排序后的条目写入索引之前再次合并它们。但通过这种方式,很大一部分工作发生在工人身上,领导者只需合并更少的大条目,这会更有效率。Most of the parallelism infrastructure is a simplified copy of the codeused by BTREE indexes, omitting the parts irrelevant for GIN indexes(e.g. uniqueness checks).大多数并行基础设施是BTREE索引使用的代码的简化副本,省略了与GIN索引无关的部分(例如唯一性检查)。Original patch by me, with reviews and substantial improvements byMatthias van de Meent, certainly enough to make him a co-author.我的原始补丁,Matthias van de Meent的评论和实质性改进,当然足以让他成为合著者。Author: Tomas Vondra, Matthias van de MeentReviewed-by: Matthias van de Meent, Andy Fan, Kirill ReshkeDiscussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
GIN 索引很棒,但创建速度相当慢。因此,在某些情况下,这种变化有可能真正改变游戏规则。
让我们试着测试一下。我需要一些可以使用 GIN 搜索的数据,所以让我们编写一些数组:
#!/usr/bin/env ruby# frozen_string_literal: truearr = (1..100_000).to_a1_000_000.times doputs "{#{arr.sample(rand() * 100 + 50).join(',')}}"end
这将输出 100 万个随机数组,其中每个数组都有随机数量的元素(从 50 到 149),其中每个元素的数字从 1 到 100,000。将此输出保存到 tmp/test.arrays.txt,大小约为560MB。
现在,我可以创建一个表,并加载数据:
=$ create table test_gin (the_array int4[]);=$ \copy test_gin from /tmp/test.arrays.txt
之后,该表为444MB,如\dt+ test_gin
展示的那样。
因此,我确保pg不会使用并行处理,并创建了索引:
=$ set max_parallel_maintenance_workers = 0;SET=$ create index qq on test_gin using gin (the_array);CREATE INDEXTime: 77111.482 ms (01:17.111)
=$ explain (analyze, buffers) select * from test_gin where the_array @> '{1,2}'::int4[];QUERY PLAN───────────────────────────────────────────────────────────────────────────────────────────────────────────────────Bitmap Heap Scan on test_gin (cost=51.31..150.05 rows=25 width=421) (actual time=0.093..0.094 rows=1.00 loops=1)Recheck Cond: (the_array @> '{1,2}'::integer[])Heap Blocks: exact=1Buffers: shared hit=4 read=6I/O Timings: shared read=0.017-> Bitmap Index Scan on qq (cost=0.00..51.30 rows=25 width=0) (actual time=0.082..0.082 rows=1.00 loops=1)Index Cond: (the_array @> '{1,2}'::integer[])Buffers: shared hit=4 read=5I/O Timings: shared read=0.013Planning:Buffers: shared hit=12 read=6 dirtied=4I/O Timings: shared read=0.550Planning Time: 0.764 msExecution Time: 0.112 ms(14 rows)
检查其健全性、大小:\di+ qq
显示的索引大小为 1099 MB。
现在,看是否可以并行处理:
=$ drop index qq;DROP INDEX=$ set max_parallel_maintenance_workers = 10;SET=$ create index qq on test_gin using gin (the_array);CREATE INDEXTime: 53602.822 ms (00:53.603)
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND…pgdba 361901 10.2 0.3 282400 209020 ? Rs 12:41 2:41 \_ postgres: depesz depesz [local] CREATE INDEXpgdba 391026 99.4 0.0 270036 60428 ? Rs 13:07 0:29 \_ postgres: parallel worker for PID 361901…
$ \di+ qqList of indexesSchema │ Name │ Type │ Owner │ Table │ Persistence │ Access method │ Size │ Description────────┼──────┼───────┼────────┼──────────┼─────────────┼───────────────┼────────┼─────────────public │ qq │ index │ depesz │ test_gin │ permanent │ gin │ 769 MB │ [null](1 row)
=$ explain (analyze, buffers) select * from test_gin where the_array @> '{1,2}'::int4[];QUERY PLAN───────────────────────────────────────────────────────────────────────────────────────────────────────────────────Bitmap Heap Scan on test_gin (cost=34.31..133.05 rows=25 width=421) (actual time=0.152..0.153 rows=1.00 loops=1)Recheck Cond: (the_array @> '{1,2}'::integer[])Heap Blocks: exact=1Buffers: shared hit=4 read=6I/O Timings: shared read=0.026-> Bitmap Index Scan on qq (cost=0.00..34.30 rows=25 width=0) (actual time=0.143..0.144 rows=1.00 loops=1)Index Cond: (the_array @> '{1,2}'::integer[])Buffers: shared hit=4 read=5I/O Timings: shared read=0.023Planning:Buffers: shared hit=5 read=2 dirtied=1I/O Timings: shared read=0.015Planning Time: 0.200 msExecution Time: 0.176 ms(14 rows)
也许更聪明的人能够告诉我为什么没有并行度的索引更大,但无论如何 - 它看起来很棒。
感谢所有让这一切成为可能的人。
文章转载自飞象数据,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




