一个不同的全局索引实现方法

开源软件联盟PostgreSQL分会 2022-11-04

528

概述

几年前，有人提议为 PostgreSQL 添加全局索引以支持分区表。在该提议之后，进行了多次讨论，并发布了一个初始版本的概念验证（POC），以证明其可行性、技术挑战和潜在优势等。然而，PostgreSQL 官方版本中仍然没有提供全局索引功能。本文尝试分享一种不同的方法，希望它能够被实现，以使一些用户受益。

全局索引基本需求

在 Postgres 分区表上使用全局索引有很多原因，如跨多个分区只读查询的性能优化、跨多个分区使用非分区键作为索引的唯一性等。举个例子，如果用户发现一个表增长得太快，当应用程序使用 key2 检索数据时，他们不得不根据 key1 将该表拆分为多个分区。在这种情况下，使用可用全局索引，他们就可以潜在地避免对应用程序进行不必要的更改。

不同的方式

为了解决这个全局索引的需求，PostgreSQL 社区讨论中的一个 POC 尝试将全局索引存储在单个索引关系中。这样，它肯定会带来更好的性能，因为所有元组索引关系都由一个索引关系文件组织。然而，有一个问题，它将遇到一个索引关系文件的物理大小限制，例如基于块编号（BlockNumber）的硬限制。另一个问题是，每次分离分区都需要清理这个唯一的全局索引关系，这有点违背了最初的分区表设计思想。正如前面提到的，分区表的设计思想之一是低成本地添加和删除分区。

另一种方法是，我们可以考虑将全局索引关系基于分区键来分开存储，添加逻辑操作允许全局访问这些分开存储的全局索引关系（具有非分区键的惟一性限制）。在这种方法中，我们保持了原始分区表设计的优点。一个是一个关系的大小限制，因为我们有分开存储的全局索引关系，另一个是很容易保持分离操作的性能。对于附加分区，它将取决于它是一个空表还是一个包含数据甚至索引的表。

该方法的主要思想是通过消除必须包含分区键的限制来使用现有功能，并添加逻辑来处理索引构建期间对非分区键和交叉分区排序的全局唯一性检查。

一些初步测试

基于第二种方法以及一些基本改变，下面分享一些简单的测试成果。

4.1. 设置分区

首先，使用 pgbench 创建 12 个分区并加载规模为 1000 的数据。
$ pgbench -i -s 1000 --partitions=12 --partition-method=range -d postgres

4.2. 设置全局索引

其次，创建不带分区键（aid）限制的全局索引，但对非分区键（bid）进行全局惟一性检查。

postgres=# create unique index gidx on pgbench_accounts using btree(bid) global;
Here is how the schema looks like after a global index has been created.

下面是创建全局索引后模式的样子。

postgres=# \d+ pgbench_accounts

                                   Partitioned table "public.pgbench_accounts"
  Column  |     Type      | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
----------+---------------+-----------+----------+---------+----------+-------------+--------------+-------------
 aid      | integer       |           | not null |         | plain    |             |              |
 bid      | integer       |           |          |         | plain    |             |              |
 abalance | integer       |           |          |         | plain    |             |              |
 filler   | character(84) |           |          |         | extended |             |              |
Partition key: RANGE (aid)
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
    "gidx" UNIQUE, btree (bid)
Partitions: pgbench_accounts_1 FOR VALUES FROM (MINVALUE) TO (8333335),
            pgbench_accounts_10 FOR VALUES FROM (75000007) TO (83333341),
            pgbench_accounts_11 FOR VALUES FROM (83333341) TO (91666675),
            pgbench_accounts_12 FOR VALUES FROM (91666675) TO (MAXVALUE),
            pgbench_accounts_2 FOR VALUES FROM (8333335) TO (16666669),
...

为了区分全局索引关系和正常索引关系，下面我们用 g 替换 i。

postgres=# select oid, relname, relnamespace, reltype, reloftype, relam, relfilenode, relpages, reltuples, relhasindex, relkind from pg_class where relnamespace=2200 order by oid;

  oid  |           relname           | relnamespace | reltype | reloftype | relam | relfilenode | relpages | reltuples | relhasindex | relkind
-------+-----------------------------+--------------+---------+-----------+-------+-------------+----------+-----------+-------------+---------
 16690 | gidx                        |         2200 |       0 |         0 |   403 |           0 |        0 |              0 | f           | I
 16691 | pgbench_accounts_1_bid_idx  |         2200 |       0 |         0 |   403 |       16691 |    22852 |   8.333334e+06 | f           | g
 16692 | pgbench_accounts_2_bid_idx  |         2200 |       0 |         0 |   403 |       16692 |    22852 |   8.333334e+06 | f           | g
 16693 | pgbench_accounts_3_bid_idx  |         2200 |       0 |         0 |   403 |       16693 |    22852 |   8.333334e+06 | f           | g
...

4.3. 使用全局索引的查询

现在，让我们使用非分区键（bid）运行一个简单的查询来对比性能

postgres=# select * from pgbench_accounts where bid=75000007;
   aid    |   bid    | abalance |                                        filler                                        
----------+----------+----------+--------------------------------------------------------------------------------------
 75000007 | 75000007 |        0 |                                                                                     
(1 row)

Time: 2.243 ms


postgres=# explain select * from pgbench_accounts where bid=75000007;
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Append  (cost=0.43..101.46 rows=12 width=97)
   ->  Index Scan using pgbench_accounts_1_bid_idx on pgbench_accounts_1  (cost=0.43..8.45 rows=1 width=97)
         Index Cond: (bid = 75000007)
   ->  Index Scan using pgbench_accounts_2_bid_idx on pgbench_accounts_2  (cost=0.43..8.45 rows=1 width=97)
         Index Cond: (bid = 75000007)
...

正如上面的执行计划所示，使用了索引扫描。

4.4. 不使用索引查询

然后，让我们删除这个全局索引，并再次运行相同的查询，

postgres=# drop index gidx;
DROP INDEX


postgres=# select * from pgbench_accounts where bid=75000007;

   aid    |   bid    | abalance |                                        filler                                        
----------+----------+----------+--------------------------------------------------------------------------------------
 75000007 | 75000007 |        0 |                                                                                     
(1 row)

Time: 8345.590 ms (00:08.346)


postgres=# explain select * from pgbench_accounts where bid=75000007;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..2161189.59 rows=12 width=97)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..2160188.39 rows=12 width=97)
         ->  Parallel Seq Scan on pgbench_accounts_1  (cost=0.00..180015.78 rows=1 width=97)
               Filter: (bid = 75000007)
         ->  Parallel Seq Scan on pgbench_accounts_2  (cost=0.00..180015.78 rows=1 width=97)
               Filter: (bid = 75000007)
...

在这里，我们可以看到有和没有全局索引的性能差异很大（2.243 ms vs. 8345.590 ms）

4.5. 使用具有分区键限制索引的查询

现在，让我们用分区键（aid）构建一个索引，

postgres=# create unique index lidx on pgbench_accounts using btree(aid, bid);
CREATE INDEX

postgres=# select * from pgbench_accounts where bid=75000007;
   aid    |   bid    | abalance |                                        filler                                        
----------+----------+----------+--------------------------------------------------------------------------------------
 75000007 | 75000007 |        0 |                                                                                     
(1 row)

Time: 3312.177 ms (00:03.312)

postgres=# explain select * from pgbench_accounts where bid=75000007;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.43..1846949.37 rows=12 width=97)
   ->  Index Scan using pgbench_accounts_1_aid_bid_idx on pgbench_accounts_1  (cost=0.43..153912.45 rows=1 width=97)
         Index Cond: (bid = 75000007)
   ->  Index Scan using pgbench_accounts_2_aid_bid_idx on pgbench_accounts_2  (cost=0.43..153912.45 rows=1 width=97)
         Index Cond: (bid = 75000007)

使用相同的查询，我们仍然可以看到全局索引和带分区键限制的原始索引之间的很大差异（2.243 ms vs. 3312.177 ms）。

4.6. 非分区键的唯一性

下面的例子试图说明，如果不对全局索引提供的非分区键进行惟一性检查，则重复的 bid 记录的插入无法被检测到。

postgres=# insert into pgbench_accounts values(100000001, 75000007, 0, '');
INSERT 0 1

postgres=# select * from pgbench_accounts where bid=75000007;
    aid    |   bid    | abalance |                                        filler                                   
-----------+----------+----------+--------------------------------------------------------------------------------------
  75000007 | 75000007 |        0 |                                                                                     
 100000001 | 75000007 |        0 |                                                                                     
(2 rows)

但是，通过全局索引提供的非分区键的唯一性检查，可以检测并阻止一个重复 bid 记录的插入。

postgres=# insert into pgbench_accounts values(100000001, 75000007, 0, ”);
ERROR: duplicate key value violates unique constraint “pgbench_accounts_10_bid_idx”
DETAIL: Key (bid)=(75000007) already exists.

总结

在这篇博客中，我解释了一种不同的方法来实现逻辑全局索引功能，但将物理存储分开，这可以潜在地保留 PostgreSQL 上原始的分区表设计理念，并演示了查询性能的优势和对非分区键的唯一性检查。

点击此处阅读原文

索引 postgresql 分区表

文章转载自开源软件联盟PostgreSQL分会，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

一个不同的全局索引实现方法

评论