PostgreSQL 分区表大量分区时, 聚合查询并行优化?

digoal 2020-03-13

710

作者

digoal

日期

2020-03-13

背景

分区表, 每个分区至少1个文件, 如果有上千个分区, 就有上千个文件, 在11以前的版本, 这样的分区表要做一次全表聚合(例如count)会比较慢, 因为需要挨个扫描.

有了并行计算之后, 这个性能得到了改观, 以PG 12为例, 2048个分区的分区表, 写入2亿左右记录, 可以1秒完成聚合.

例子

分区表的并行度取决于分区的并行度, 所以我们建标时, 强制给每个分区设定52度.

do language plpgsql $$ declare begin drop table if exists p; create table p (id int , info text, crt_time timestamp) partition by range (crt_time); create table p2020 partition of p FOR VALUES FROM ('2020-01-01') TO ('2021-01-01') partition by hash (id); create table p2021 partition of p FOR VALUES FROM ('2021-01-01') TO ('2022-01-01') partition by hash (id); for i in 0..1023 loop execute format ('create table p2020_%s partition of p2020 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i); execute format ('create table p2021_%s partition of p2021 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i); execute format('alter table p2020_%s set (parallel_workers=52)',i); execute format('alter table p2021_%s set (parallel_workers=52)',i); end loop; end; $$;

压测写入

vi test.sql \set id random(1,2000000000) insert into p values (:id, random()::text, now());

```
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 52 -j 52 -T 600

progress: 14.0 s, 459890.7 tps, lat 0.113 ms stddev 0.132
progress: 15.0 s, 439633.9 tps, lat 0.118 ms stddev 0.209
progress: 16.0 s, 440804.9 tps, lat 0.118 ms stddev 0.233
progress: 17.0 s, 444148.2 tps, lat 0.117 ms stddev 0.148
progress: 18.0 s, 463751.1 tps, lat 0.112 ms stddev 0.094
progress: 19.0 s, 469972.4 tps, lat 0.110 ms stddev 0.078
......
```

强制并行, 以得到最好的所有分区聚合性能, 如下:

```
db1=# show max_worker_processes ;
max_worker_processes

128
(1 row)
set max_parallel_workers =32;
set max_parallel_workers_per_gather =32;
set parallel_setup_cost =0;
set parallel_tuple_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set parallel_leader_participation=off;
```

另外有一个参数, 根据实际情况选择

情况1:

```
set enable_partitionwise_aggregate=off;

db1=# explain select count(*) from p;
QUERY PLAN

Finalize Aggregate (cost=847577.04..847577.05 rows=1 width=8)
-> Gather (cost=847576.95..847576.96 rows=32 width=8)
Workers Planned: 32
-> Partial Aggregate (cost=847576.95..847576.96 rows=1 width=8)
-> Parallel Append (cost=0.00..840170.98 rows=2962388 width=0)
-> Parallel Seq Scan on p2020_25 (cost=0.00..920.11 rows=3011 width=0)
-> Parallel Seq Scan on p2020_4 (cost=0.00..905.94 rows=2994 width=0)
-> Parallel Seq Scan on p2020_42 (cost=0.00..898.65 rows=2965 width=0)
-> Parallel Seq Scan on p2020_49 (cost=0.00..891.59 rows=2859 width=0)
-> Parallel Seq Scan on p2020_3 (cost=0.00..890.98 rows=2998 width=0)
-> Parallel Seq Scan on p2020_13 (cost=0.00..890.25 rows=3025 width=0)
```

开多个并行, 一次聚合多个分区, 每个partition非并行聚合, 并行append结果.
适合分区很多小表的情况(例如单表1GB内).

情况2:

```
set enable_partitionwise_aggregate=on;

db1=# explain select count(*) from p;
QUERY PLAN

Finalize Aggregate (cost=832959.57..832959.58 rows=1 width=8)
-> Gather (cost=874.54..832795.73 rows=65536 width=8)
Workers Planned: 32
-> Parallel Append (cost=874.54..832795.73 rows=2048 width=8)
-> Partial Aggregate (cost=927.63..927.64 rows=1 width=8)
-> Parallel Seq Scan on p2020_25 (cost=0.00..920.11 rows=3011 width=0)
-> Partial Aggregate (cost=913.43..913.44 rows=1 width=8)
-> Parallel Seq Scan on p2020_4 (cost=0.00..905.94 rows=2994 width=0)
-> Partial Aggregate (cost=906.07..906.08 rows=1 width=8)
-> Parallel Seq Scan on p2020_42 (cost=0.00..898.65 rows=2965 width=0)
-> Partial Aggregate (cost=898.73..898.74 rows=1 width=8)
-> Parallel Seq Scan on p2020_49 (cost=0.00..891.59 rows=2859 width=0)
-> Partial Aggregate (cost=898.47..898.48 rows=1 width=8)
-> Parallel Seq Scan on p2020_3 (cost=0.00..890.98 rows=2998 width=0)
```

一次运算一个分区, 一个分区开多个并行计算得到结果, 多阶段并行聚合结果.
适合分区数少于16, 但是单个分区较大(例如10GB以上) 的情况.

到底开启还是关闭enable_partitionwise_aggregate, 看怎么快就怎么玩. 一般来说就是要让计算时间大于调度耗费的时间.

```
db1=# select count(*) from p;
count

212758999
(1 row)

Time: 1098.609 ms (00:01.099)
```

问题2

你会发现, 刚刚写入完成, 立马去count查询会很慢很慢, 原因是什么呢?

从会话表可以看到, 等待CLogControlLock锁, 为什么查询会有这个锁呢?

因为PG每条记录的头信息里面有记录这条记录的事务状态(提交、回滚), 但是PG在提交、回滚事务时立即完成, 并不需要去更新这个事务对应写记录的头信息, 而是写clog文件(每个事务2个BIT), 那么问题来了? 记录头信息里面的事务状态什么时候更新呢?

第一次touch到这条记录的时候(例如select, vacuum, analyze, autovacuum, update,等等), 会去clog对应bit里面拿到事务状态, 然后去更新记录头信息对应的事务状态掩码.

而且当开启了datafile block checksum或者设置了wal_log_hints=on时, 这个更新记录头信息对应的事务状态掩码的操作也会产生WAL日志, 所以查询也会产生日志.

因此, 为什么第一次查询会慢呢? 原因就是在更新记录头信息里面的 hint bit.

更新之后就快了, 不需要再到clog里面判断记录到事务结束状态, 直接看记录头信息的掩码就知道了.

关于由于更新hint bit导致的第一次查询慢的内核优化思路

内核改进, 有一个开关可以设置, 表级别, 绘画级别, 集群级别可设置的要不要修改hint bit, 例如一些日志表, 流水表, 写完很快被删掉, 没必要再去更新hint bit, 还不如就不更新了, 查询的时候去clog获取.

内核改进, 提高CLogControlLock并发性, 不要堵塞, 这样才能体现并行计算的能力

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL 分区表大量分区时, 聚合查询并行优化?

作者

日期

标签

背景

例子

问题2

关于由于更新hint bit导致的第一次查询慢的内核优化思路

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论