PostgreSQL 优化器并行计算

ClickHouse周边 2023-02-14

934

1.优化器并行计算的相关参数

      PostgreSQL会通过以下这些参数来决定是否使用并行，以及该启用几个work process。以下参数PG 9.6之后通用。

max_worker_processes
设置系统能够支持的后台进程的最大数量，只能在服务器启动时设置。默认值为 8。standby参数设置为等于或者高于主控服务器上的值。否则，后备服务器上可能不会允许查询。如果设置为0，表示不允许并行。
max_parallel_workers_per_gather
设置单个Gather节点能够开始的工作者的最大数量。并行工作者会从max_worker_processes建立的进程池中取得，数量由max_parallel_workers限制。默认值是2。把这个值设置为 0（默认值）将会禁用并行查询执行。
注意所要求的工作者数量在运行时可能实际无法被满足。如果这种事情发生，该计划将会以比预期更少的工作者运行，这可能会不太高效。同时需要注意，在OLTP业务系统中，不要设置太大，因为每个worker都会消耗同等的work_mem等资源，争抢会比较厉害。建议在OLAP中使用并行，并且做好任务调度，减轻冲突。

例子，WITH语法中，有两个QUERY用来并行计算，虽然设置的max_parallel_workers_per_gather=6，但是由于max_worker_processes=8，所以第一个Gather node用了6个worker process，而另一个Gather实际上只用了2个worker。

postgres=# show max_worker_processes ;  
 max_worker_processes   
----------------------  
 8  
(1 row)  


postgres=# set max_parallel_workers_per_gather=6;  
SET  
postgres=# explain (analyze,verbose,costs,timing,buffers) with t as (select count(*) from test), t1 as (select count(id) from test) select * from t,t1;  
                                                                            QUERY PLAN                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Nested Loop  (cost=159471.81..159471.86 rows=1 width=16) (actual time=7763.033..7763.036 rows=1 loops=1)  
   Output: t.count, t1.count  
   Buffers: shared hit=32940 read=74784  
   CTE t  
     ->  Finalize Aggregate  (cost=79735.90..79735.91 rows=1 width=8) (actual time=4714.114..4714.115 rows=1 loops=1)  
           Output: count(*)  
           Buffers: shared hit=16564 read=37456  
           ->  Gather  (cost=79735.27..79735.88 rows=6 width=8) (actual time=4714.016..4714.102 rows=7 loops=1)  
                 Output: (PARTIAL count(*))  
                 Workers Planned: 6  
                 Workers Launched: 6  
                 Buffers: shared hit=16564 read=37456  
                 ->  Partial Aggregate  (cost=78735.27..78735.28 rows=1 width=8) (actual time=4709.465..4709.466 rows=1 loops=7)  
                       Output: PARTIAL count(*)  
                       Buffers: shared hit=16084 read=37456  
                       Worker 0: actual time=4709.146..4709.146 rows=1 loops=1  
                         Buffers: shared hit=2167 read=5350  
                       Worker 1: actual time=4708.156..4708.156 rows=1 loops=1  
                         Buffers: shared hit=2140 read=5288  
                       Worker 2: actual time=4708.370..4708.370 rows=1 loops=1  
                         Buffers: shared hit=2165 read=4990  
                       Worker 3: actual time=4708.968..4708.969 rows=1 loops=1  
                         Buffers: shared hit=2501 read=5529  
                       Worker 4: actual time=4709.194..4709.195 rows=1 loops=1  
                         Buffers: shared hit=2469 read=5473  
                       Worker 5: actual time=4708.812..4708.813 rows=1 loops=1  
                         Buffers: shared hit=2155 read=5349  
                       ->  Parallel Seq Scan on public.test  (cost=0.00..73696.22 rows=2015622 width=0) (actual time=0.051..2384.380 rows=1728571 loops=7)  
                             Buffers: shared hit=16084 read=37456  
                             Worker 0: actual time=0.046..2385.108 rows=1698802 loops=1  
                               Buffers: shared hit=2167 read=5350  
                             Worker 1: actual time=0.057..2384.698 rows=1678728 loops=1  
                               Buffers: shared hit=2140 read=5288  
                             Worker 2: actual time=0.061..2384.109 rows=1617030 loops=1  
                               Buffers: shared hit=2165 read=4990  
                             Worker 3: actual time=0.046..2387.143 rows=1814780 loops=1  
                               Buffers: shared hit=2501 read=5529  
                             Worker 4: actual time=0.046..2382.491 rows=1794892 loops=1  
                               Buffers: shared hit=2469 read=5473  
                             Worker 5: actual time=0.070..2383.598 rows=1695904 loops=1  
                               Buffers: shared hit=2155 read=5349  
   CTE t1  
     ->  Finalize Aggregate  (cost=79735.90..79735.91 rows=1 width=8) (actual time=3048.902..3048.902 rows=1 loops=1)  
           Output: count(test_1.id)  
           Buffers: shared hit=16376 read=37328  
           ->  Gather  (cost=79735.27..79735.88 rows=6 width=8) (actual time=3048.732..3048.880 rows=3 loops=1)  
                 Output: (PARTIAL count(test_1.id))  
                 Workers Planned: 6  
                 Workers Launched: 2  
                 Buffers: shared hit=16376 read=37328  
                 ->  Partial Aggregate  (cost=78735.27..78735.28 rows=1 width=8) (actual time=3046.399..3046.400 rows=1 loops=3)  
                       Output: PARTIAL count(test_1.id)  
                       Buffers: shared hit=16212 read=37328  
                       Worker 0: actual time=3045.394..3045.395 rows=1 loops=1  
                         Buffers: shared hit=5352 read=12343  
                       Worker 1: actual time=3045.339..3045.340 rows=1 loops=1  
                         Buffers: shared hit=5354 read=12402  
                       ->  Parallel Seq Scan on public.test test_1  (cost=0.00..73696.22 rows=2015622 width=4) (actual time=0.189..1614.261 rows=4033333 loops=3)  
                             Output: test_1.id  
                             Buffers: shared hit=16212 read=37328  
                             Worker 0: actual time=0.039..1617.258 rows=3999030 loops=1  
                               Buffers: shared hit=5352 read=12343  
                             Worker 1: actual time=0.033..1610.934 rows=4012856 loops=1  
                               Buffers: shared hit=5354 read=12402  
   ->  CTE Scan on t  (cost=0.00..0.02 rows=1 width=8) (actual time=4714.120..4714.121 rows=1 loops=1)  
         Output: t.count  
         Buffers: shared hit=16564 read=37456  
   ->  CTE Scan on t1  (cost=0.00..0.02 rows=1 width=8) (actual time=3048.907..3048.908 rows=1 loops=1)  
         Output: t1.count  
         Buffers: shared hit=16376 read=37328  
 Planning time: 0.144 ms  
 Execution time: 7766.458 ms  
(72 rows)

parallel_setup_cost
启动woker process的启动成本，因为启动worker进程需要建立共享内存等操作，属于附带的额外成本。默认是 1000。
parallel_tuple_cost
woker进程处理完后的tuple要传输给上层node，即进程间的row交换成本，也就是从一个并行工作进程传递一个元组给另一个进程的代价估计。默认是 0.1。按node评估的输出rows来乘。
min_parallel_relation_size
表的大小，也作为是否启用并行计算的条件，如果小于它，不启用并行计算。但是也请注意，还有其他条件决定是否启用并行，所以并不是小于它的表就一定不会启用并行。默认是8MB。
force_parallel_mode
强制开启并行，可以作为测试的目的，也可以作为hint来使用。
更具体地说，把这个值设置为on会在任何一个对于并行查询安全的查询计划顶端增加一个Gather节点，这样查询会在一个并行工作者中运行。即便当一个并行工作者不可用或者不能被使用时，诸如开始一个子事务等在并行查询环境中会被禁止的操作将会被禁止，除非规划器相信这样做会导致查询失败。当这个选项被设置时如果出现失败或者意料之外的结果，查询使用的某些函数可能需要被标记为PARALLEL UNSAFE（或者可能是PARALLEL RESTRICTED）。
允许值是off（只在期望改进性能时才使用并行模式）、on（只要查询被认为是安全的，就强制使用并行查询）以及regress（和on相似，但是有如下文所解释的额外行为改变）。
parallel_workers

max_parallel_workers
# 设置系统为并行操作所支持的工作者的最大数量。默认值为8。在增加或者减小这个值时，也要考虑对max_parallel_maintenance_workers以及max_parallel_workers_per_gather进行调整。此外，要注意将这个值设置得大于max_worker_processes将不会产生效果，因为并行工作者进程都是从max_worker_processes所建立的工作者进程池中取出来的。


max_parallel_workers_per_gather 
# 设置单个Gather或者Gather Merge节点能够开始的工作者的最大数量。并行工作者会从max_worker_processes建立的进程池中取得，数量由max_parallel_workers限制。注意所要求的工作者数量在运行时可能实际无法被满足。如果这种事情发生，该计划将会以比预期更少的工作者运行，这可能会不太高效。默认值是2。把这个值设置为 0（默认值）将会禁用并行查询执行。

      以上都是数据库的参数，parallel_workers是表级参数，可以在建表时设置，也可以后期设置。

#建表
create table ... WITH( storage parameter ... )  


# 设置表级并行度  
alter table test set (parallel_workers=0);  


# 关闭表的并行  
alter table test set (parallel_workers=0);  


# 重置参数，那么在create_plain_partial_paths中会通过表的pages计算出一个合理的并行度  alter table test reset (parallel_workers);

2.PG优化器如何决定并行

      其实前面在讲参数时都已经讲到了，这里再总结一下。

决定整个系统能开多少个worker进程
max_worker_processes
计算并行计算的成本，优化器根据CBO原则选择是否开启并行
parallel_setup_cost
parallel_tuple_cost
所以简单QUERY，如果COST本来就很低（比如小于并行计算的启动成本），那么很显然数据库不会对这种QUERY启用并行计算。
强制开启并行的开关
force_parallel_mode
当第二步计算出来的成本大于非并行的成本时，可以通过这种方式强制让优化器开启并行查询。
根据表级parallel_workers参数决定每个Gather node的并行度
取min(parallel_workers, max_parallel_workers_per_gather)
当表没有设置parallel_workers参数并且表的大小大于min_parallel_relation_size是由算法决定每个Gather node的并行度
注意实际上，每个Gather能开启多少个worker还和PG集群总体剩余可以开启的worker进程数相关。
因此实际开启的可能小于优化器算出来的。从前面的例子中也可以理解。
用户也可以使用hint来控制优化器选择是否强制并行 , 参考pg_hint_plan插件的用法。

3.如何通过参数设置开多少个并行

假如要让某个QUERY开启32个并行，如何设置？

足够大的max_worker_processes
```
max_worker_processes = 128
```
足够大的max_parallel_workers_per_gather
```
max_parallel_workers_per_gather = 32
```

以下设置为0

parallel_tuple_cost = 0
parallel_setup_cost = 0
min_parallel_relation_size = 0

强制并行
```
force_parallel_mode = on
```
最容易忽视的参数，表级并行度，如果没有设置，内核会根据表的大小算出一个。

# 设置表级并行度
alter table test set (parallel_workers=32);


#关闭表的并行
alter table test reset (parallel_workers);


#重置参数，那么在create_plain_partial_paths中会通过表的pages计算出一个合理的并行度