梧桐数据库：EXPLAIN ANALYZE详解

原创孙 2024-09-05

250

EXPLAIN简介文章中介绍了EXPALIN的几种选项和计划的基础字段，本文继续对复杂场景下EXPLAIN ANALYZE字段进行说明。

1. EXPLAIN ANALYZE字段说明

不同存储类型的表可能显示信息不同，下面以HORC表为例进行说明

-- prepare
create table t1(a int, b int);
insert into t1 select 1,1 from generate_series(1,100000) x;
insert into t1 select x,x from generate_series(1,100000) x;
update t1 set a=a+1;

explain analyze select * from t1, (select count(*),a from t1 group by a) s1 where t1.a=s1.a;
                                                                                             QUERY PLAN                                                                                             
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..902.00 rows=200000 width=20) (actual time=426.413..1261.467 rows=200000 loops=1)
   ->  Hash Join  (cost=0.00..888.62 rows=50000 width=20) (actual time=447.066..929.733 rows=125120 loops=1)
         -- actual time的447.066对应返回第一条记录的时间，hashjoin需要建hash表，时间大于hash算子的total time
         -- actual time的929.733对应算子的total time，实际hashjoin算子执行时间为929.733-447.066=482.667ms
         Hash Cond: (t1.a = t1_1.a)
         ->  Redistribute Motion 4:4  (slice2; segments: 4)  (cost=0.00..434.32 rows=50000 width=8) (actual time=0.369..229.617 rows=125120 loops=1)
               Hash Key: t1.a
               ->  Seq Scan on t1  (cost=0.00..432.32 rows=50000 width=8) (actual time=34.011..256.556 rows=99970 loops=1)
                     (seg3)   InputStream Info: 163389 byte(s) (12.122 MB/s) in 12.854 ms with 20 read call(s).
                     (seg3)   Decompress Info: byteCompress/byteDecompress 130623/156132 byte(s) (783.679 MB/s) in 0.190 ms with 14 decompress call(s).
                     (seg3)   Cache Read 48946/48946 in 4/4 call(s) for //oushudb/default_filce/16390/14142/1196996/a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_1-0-0_20240903100009608.orc
                     (seg3)   Cache Read 16344/33118 in 4/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_20240903100009608.log.1_1-0-0
                     (seg3)   Cache Read 48574/48574 in 4/4 call(s) for //oushudb/default_filce/16390/14142/1196996/d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_3-0-0_20240903100009608.orc
                     (seg3)   Cache Read 32751/32751 in 6/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_20240903100009608.log.1_3-0-0
                     (seg3)   ReadStatsOnly Scan 0 in 0 stripe(s).
                     (seg3)   Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
                     (seg3)   MergeRead Info [i,u,d]: [99970, 99970, 0] merged into [0, 99970] tuples in [24168, 43400, 0] ms with [50, 50, 0] calls.
 
         ->  Hash  (cost=439.43..439.43 rows=6562 width=12) (actual time=446.241..446.244 rows=25120 loops=1)
               -- buckets说明内存hash表的桶数，内存不足导致落盘时batches大于1，hash表根据hash值切分成batches个文件。
               Buckets: 131072  Batches: 1  Memory Usage: 2148884kB
               ->  HashAggregate  (cost=0.00..439.43 rows=6562 width=12) (actual time=381.532..418.410 rows=25120 loops=1)
                     Group Key: t1_1.a
                     -- 不同类型的节点输出的信息可能不同，hashagg显示建hash表和聚集操作的相关信息
                     -- hash table(s)说明agg使用的hash表个数，普通agg一般为1。ap函数因为对不同分组计算，会产生多个hash表
                     -- 比如group by grouping sets((a,b),(a))
                     -- ->  HashAggregate  (cost=751.00..1337.26 rows=46126 width=8) (actual time=80.245..145.887 rows=50002 loops=1)
                     --         Hash Key: a, b
                     --         Hash Key: a
                     --         (seg0)   hash table(s): 2; chain length 3.5 avg, 17 max; using 25001 of 32768 buckets; total 0 expansions.
                     -- chian说明hash碰撞情况，发生碰撞的链表最大有19条记录，平均有3.5条。
                     -- 如果发生落盘则会额外记录落盘情况，关键词为spill partitions; disk usage
                     (seg0)   hash table(s): 1; chain length 3.5 avg, 19 max; using 25007 of 32768 buckets; total 2 expansions.
 
                     ->  Redistribute Motion 4:4  (slice3; segments: 4)  (cost=0.00..433.32 rows=50000 width=4) (actual time=34.910..240.886 rows=125120 loops=1)
                           Hash Key: t1_1.a
                           -- 计划内存在同名的表，所以后续重复的表会添加下标来区分，方便和原始SQL对应
                           ->  Seq Scan on t1 t1_1  (cost=0.00..432.32 rows=50000 width=4) (actual time=33.740..224.617 rows=99970 loops=1)
                                 -- seg3说明本次执行过程中该节点返回行数最多
                                 -- InputStream Info是horc表的IO信息，40.314ms从18个stripe读了131051字节数据
                                 (seg3)   InputStream Info: 131051 byte(s) (3.100 MB/s) in 40.314 ms with 18 read call(s).
                                 -- 数据解压缩耗时0.147ms，9821字节的数据通过12次解压缩操作完成
                                 (seg3)   Decompress Info: byteCompress/byteDecompress 98291/123710 byte(s) (802.579 MB/s) in 0.147 ms with 12 decompress call(s).
                                 -- 文件对应的cache信息，命中文件cache后不再需要从hdfs读取
                                 (seg3)   Cache Read 32684/32684 in 3/3 call(s) for //oushudb/default_filce/16390/14142/1196996/a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_1-0-0_20240903100009608.orc
                                 (seg3)   Cache Read 16774/33118 in 2/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_20240903100009608.log.1_1-0-0
                                 (seg3)   Cache Read 32498/32498 in 3/3 call(s) for //oushudb/default_filce/16390/14142/1196996/d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_3-0-0_20240903100009608.orc
                                 (seg3)   Cache Read 0/32751 in 0/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_20240903100009608.log.1_3-0-0
                                 -- 是否只读了horc文件的统计信息，当前sql不满足只读统计信息的要求
                                 (seg3)   ReadStatsOnly Scan 0 in 0 stripe(s).
                                 (seg3)   Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
                                 -- Merge-On-Read信息，可以根据该信息判断是否存在放大写
                                 -- 节点读取了insert插入的99970条数据，又merge了update更新的99970条数据，最终结果为99970条数据。
                                 -- 处理insert数据耗时21.717ms，处理update数据耗时34.751ms
                                 (seg3)   MergeRead Info [i,u,d]: [99970, 99970, 0] merged into [0, 99970] tuples in [21.717, 34.751, 0] ms with [50, 50, 0] calls.
 
 Optimizer: GPORCA
 -- cache命中率
 Cache Locality Ratio: 0.78 (446.0 kB/574.5 kB)
 -- 优化器生成计划耗时
 Planning Time: 109.879 ms
 -- 各slice内存使用情况，slice1因为包含hashjoin和hashagg算子，所以内存使用较高
   (slice0)    Executor memory: 256K bytes.
 * (slice1)    Executor memory: 4178K bytes avg x 4 workers, 4184K bytes max (seg3).  Work_mem: 4881K bytes max, 4881K bytes wanted.
   (slice2)    Executor memory: 95K bytes avg x 4 workers, 95K bytes max (seg0).
   (slice3)    Executor memory: 97K bytes avg x 4 workers, 97K bytes max (seg0).
 Memory used:  262144kB
 Memory wanted:  10360kB
 -- 执行器耗时时间1.297(执行器总体耗时) = 0.001(执行器结束阶段) + 1.297(执行器执行阶段) s
 Execution Time: 1.297 = 0.001 + 1.297 s
(45 rows)

新执行器不再使用tuple行数据进行计算，使用了batch来计算一批数据，所以EXPLAIN ANALYZE中的信息会有一些区别。下面计划对差异点进行说明

set enable_data_cache to off;
set new_executor to on;
set new_executor_enable_read_statistics = on;
explain analyze select * from t1, (select count(*) a from t1) s1 where t1.a=s1.a;
                                                                              QUERY PLAN                                                                               
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..887.92 rows=200000 width=16) (actual time=280.487..280.487 rows=0 loops=1)
   InputStream Info: 0 byte(s) (nan GB/s) in 0.000 ms with 0 read call(s).
 
   ->  Hash Join  (cost=0.00..877.21 rows=50000 width=16) (actual time=0.000..267.962 rows=0 loops=1)
         Hash Cond: ((t1.a)::bigint = (count()))
         ->  Seq Scan on t1  (cost=0.00..432.32 rows=50000 width=8) (actual time=22.335..41.221 rows=99860 loops=1)
               -- seg1返回最多的数据，50个batch返回了99860行数据，最多的batch有2048行数据，最少的有609行，平均是1997.2行
               (seg1)   50 TupleBatch of 99860 Tuple, (1997.2/609/2048)(avg/min/max) Tuple/TupleBatch, 783 KB
               -- 是否跳过了ctid，ctid是系统列，如果没有指定需要输出ctid那么通常都可以skipctid
               -- explain (analyze, verbose) select t1.ctid,* from t1, (select count(*) a from t1) s1 where t1.a=s1.a;
               --         ->  Seq Scan on public.t1  (cost=0.00..431.66 rows=25000 width=24) (actual time=24.148..65.186 rows=100030 loops=1)
               --               Output: t1.a, t1.b, t1.ctid
               --               (seg3)   52 TupleBatch of 100030 Tuple, (1923.65/265/2048)(avg/min/max) Tuple/TupleBatch, 2.29 MB
               --               (seg3)   InputStream Info: 97456 byte(s) (2.511 MB/s) in 37.016 ms with 16 read call(s).
               (seg1)   SkipCtid: on
               (seg1)   InputStream Info: 97510 byte(s) (4.152 MB/s) in 22.396 ms with 8 read call(s).
               (seg1)   Decompress Info: byteCompress/byteDecompress 64852/65198 byte(s) (654.502 MB/s) in 0.095 ms with 6 decompress call(s).
               (seg1)   ReadStatsOnly Scan 0 in 0 stripe(s).
               (seg1)   Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
               (seg1)   MergeRead Info [i,u,d]: [99860, 0, 0] merged into [99860, 0] tuples in [0, 0, 0] ms with [50, 0, 0] calls.
 
         ->  Hash  (cost=432.41..432.41 rows=1 width=8) (actual time=0.000..212.683 rows=0 loops=1)
               ->  Broadcast Motion 1:4  (slice2)  (cost=0.00..432.41 rows=1 width=8) (actual time=209.089..209.132 rows=1 loops=1)
                     (seg0)   1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B
                     (seg0)   InputStream Info: 65 byte(s) (1.722 MB/s) in 0.036 ms with 1 read call(s).
                     -- 两阶段agg优化，如果distinct值较少，那么提前聚集可以减少motion数据量。
                     -- eager agg策略改写SQL也可以实现类似的功能
                     ->  Finalize Aggregate  (cost=0.00..432.41 rows=1 width=8) (actual time=0.433..0.434 rows=1 loops=1)
                           1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B
                           ->  Gather Motion 4:1  (slice3; segments: 4)  (cost=0.00..432.41 rows=1 width=8) (actual time=0.083..0.289 rows=4 loops=1)
                                 4 TupleBatch of 4 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 188 B
                                 InputStream Info: 257 byte(s) (1.763 MB/s) in 0.139 ms with 4 read call(s).
 
                                 ->  Partial Aggregate  (cost=0.00..432.41 rows=1 width=8) (actual time=24.553..24.556 rows=1 loops=1)
                                       (seg0)   1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B
                                       ->  Seq Scan on t1 t1_1  (cost=0.00..432.32 rows=50000 width=1) (actual time=30.738..44.003 rows=99860 loops=1)
                                             (seg1)   2 TupleBatch of 99860 Tuple, (49930/2048/50099)(avg/min/max) Tuple/TupleBatch, 0 B
                                             (seg1)   ReadStatsOnly: on
                                             (seg1)   SkipCtid: on
                                             (seg1)   InputStream Info: 32866 byte(s) (1.584 MB/s) in 19.789 ms with 4 read call(s).
                                             (seg1)   Decompress Info: byteCompress/byteDecompress 290/312 byte(s) (4.312 MB/s) in 0.069 ms with 4 decompress call(s).
                                             -- 因为子查询只计算了count(*)不涉及其他列的数据，所以可以只读统计信息
                                             -- 当前仅新执行器支持，并且需要设置参数new_executor_enable_read_statistics
                                             (seg1)   ReadStatsOnly Scan 2 in 2 stripe(s).
                                             (seg1)   Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
                                             -- 再次读取时horc文件已经发生合并，所以仅有insert数据
                                             (seg1)   MergeRead Info [i,u,d]: [99860, 0, 0] merged into [99860, 0] tuples in [0, 0, 0] ms with [2, 0, 0] calls.
 
 Optimizer: GPORCA
 NewExecutor: ON
 Cache Locality Ratio: 0.00 (0.0 B/254.5 kB)
 Planning Time: 129.084 ms
   (slice0)    Executor memory: 66K bytes.
   (slice1)    Executor memory: 106K bytes avg x 4 workers, 106K bytes max (seg0).  Work_mem: 49K bytes max.
   (slice2)    Executor memory: 210K bytes (entry db).
   (slice3)    Executor memory: 103K bytes avg x 4 workers, 103K bytes max (seg0).
 Memory used:  262144kB
 Execution Time: 0.385 = 0.000 + 0.385 s 
(47 rows)

2. 相关guc参数

2.1 gp_enable_explain_allstat

gp_enable_explain_allstat可以设置为true or false，用于打印每个segment的执行耗时详情。如果怀疑数据倾斜情况比较严重，可以设置此参数来分析每个segment节点计算量，但是节点数目较多的情况下可能导致计划行数很多。

-- 信息过多，只对主要信息进行说明
explain analyze select * from t1, (select count(*),a from t1 group by a) s1 where t1.a=s1.a;
                                                                               QUERY PLAN                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..879.82 rows=100000 width=20) (actual time=700.442..1040.044 rows=200000 loops=1)
   allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
            seg-1        rows_out 200000, start_offset_by 69 ms, 700 ms to_first, 1040 ms to_end, in 1 loop
 
   ->  Hash Join  (cost=0.00..873.13 rows=25000 width=20) (actual time=701.297..920.036 rows=75169 loops=1)
         Hash Cond: (t1.a = t1_1.a)
         allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
                  -- 4个节点的stats信息，行数上看存在一定不均，不过并不严重。最多的是seg1有75169行，最少的是seg3有24901行
                  -- start_offset_by说明执行器初始化的时间，一般比较短
                  -- to_first说明返回第一条数据的时间，to_end则是算子总体的执行时间
                  -- hashjoin要等待hash算子建完hash表，所以to_first会较大
                  seg0   rows_out 49922, start_offset_by 71 ms, 692 ms to_first, 863 ms to_end, in 1 loop
                  seg1   rows_out 75169, start_offset_by 70 ms, 701 ms to_first, 920 ms to_end, in 1 loop
                  seg2   rows_out 50008, start_offset_by 70 ms, 702 ms to_first, 868 ms to_end, in 1 loop
                  seg3   rows_out 24901, start_offset_by 73 ms, 692 ms to_first, 776 ms to_end, in 1 loop
         ->  Seq Scan on public.t1  (cost=0.00..431.66 rows=25000 width=8) (actual time=43.003..132.725 rows=75169 loops=1)
...
         ->  Hash  (cost=434.88..434.88 rows=1 width=12) (actual time=656.866..656.867 rows=100000 loops=1)
               allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
                        seg0     rows_out 100000, start_offset_by 72 ms, 657 ms to_first, 657 ms to_end, in 1 loop
                        seg1     rows_out 100000, start_offset_by 75 ms, 653 ms to_first, 653 ms to_end, in 1 loop
                        seg2     rows_out 100000, start_offset_by 71 ms, 669 ms to_first, 669 ms to_end, in 1 loop
                        seg3     rows_out 100000, start_offset_by 75 ms, 652 ms to_first, 652 ms to_end, in 1 loop
               ->  Broadcast Motion 4:4  (slice2; segments: 4)  (cost=0.00..434.88 rows=1 width=12) (actual time=369.708..566.214 rows=100000 loops=1)
...
                     ->  Finalize HashAggregate  (cost=0.00..434.88 rows=1 width=12) (actual time=371.146..409.245 rows=25119 loops=1)
                           Output: t1_1.a, count()
                           Group Key: t1_1.a
                           work_mem: 19617kB  Segments: 4  Max: 4905kB (segment 0)  Workfile: (0 spilling)
                           Work_mem wanted: 4905K bytes avg, 4905K bytes max (seg0) to lessen workfile I/O affecting 4 workers.
                           -- 所有节点的extra信息也会打印出来，而不是只打印返回行数最多节点的信息。
                           (seg0)   hash table(s): 1; chain length 3.5 avg, 19 max; using 25006 of 32768 buckets; total 14 expansions.
 
                           (seg2)   hash table(s): 1; chain length 3.5 avg, 13 max; using 24973 of 32768 buckets; total 14 expansions.
 
                           (seg1)   hash table(s): 1; chain length 3.5 avg, 14 max; using 24902 of 32768 buckets; total 14 expansions.
 
                           (seg3)   hash table(s): 1; chain length 3.5 avg, 16 max; using 25119 of 32768 buckets; total 14 expansions.
 
                           allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
                                    seg0 rows_out 25006, start_offset_by 102 ms, 368 ms to_first, 397 ms to_end, in 1 loop
                                    seg2 rows_out 24973, start_offset_by 98 ms, 377 ms to_first, 410 ms to_end, in 1 loop
                                    seg1 rows_out 24902, start_offset_by 102 ms, 370 ms to_first, 397 ms to_end, in 1 loop
                                    seg3 rows_out 25119, start_offset_by 99 ms, 371 ms to_first, 409 ms to_end, in 1 loop

...

2.2 explain_memory_verbosity

设置为SUPPRESS, SUMMARY, and DETAIL。用于打印内存使用详情。

                                                                                                 QUERY PLAN                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..879.82 rows=100000 width=20) (actual time=767.486..1212.341 rows=200000 loops=1)
   Executor Memory: 33kB  Segments: 1  Max: 33kB (segment -1)
   ->  Hash Join  (cost=0.00..873.13 rows=25000 width=20) (actual time=771.139..1001.536 rows=75169 loops=1)
         Hash Cond: (t1.a = t1_1.a)
         -- 设置参数后会打印算子的内存信息以及最大值
         Executor Memory: 23330kB  Segments: 4  Max: 5833kB (segment 0)
...

 Optimizer: GPORCA
 Cache Locality Ratio: 0.00 (0.0 B/318.1 kB)
 Settings: enable_data_cache = 'off'
 Planning Time: 70.609 ms
   -- 设置参数后会额外通过vmem字段显示slice内存整体的分配情况
   -- 内存是通过内存上下文管理，当前内存上下文模式是多个内存块组成的allocset，避免了小内存的频繁释放与申请
   -- 内存上下文会提前分配好allocset，vmem记录了真实malloc的大小。
   -- executor memory是执行器实际计算用到的内存
   (slice0)    Executor memory: 75K bytes.  Vmem reserved: 59392K bytes.
   (slice1)    Executor memory: 5843K bytes avg x 4 workers, 5843K bytes max (seg0).  Vmem reserved: 21504K bytes avg x 4 workers, 21504K bytes max (seg0).
 * (slice2)    Executor memory: 2044K bytes avg x 4 workers, 2050K bytes max (seg3).  Vmem reserved: 18432K bytes avg x 4 workers, 18432K bytes max (seg0).  Work_mem: 4905K bytes max, 4905K bytes wanted.
 * (slice3)    Executor memory: 2135K bytes avg x 4 workers, 4071K bytes max (seg0).  Vmem reserved: 18944K bytes avg x 4 workers, 22528K bytes max (seg0).  Work_mem: 7721K bytes max, 7721K bytes wanted.
 Memory used:  262144kB
 Memory wanted:  23760kB
 Execution Time: 1.201 = 0.000 + 1.202 ms
(162 rows)