EXPLAIN简介文章中介绍了EXPALIN的几种选项和计划的基础字段,本文继续对复杂场景下EXPLAIN ANALYZE字段进行说明。
1. EXPLAIN ANALYZE字段说明
不同存储类型的表可能显示信息不同,下面以HORC表为例进行说明
-- prepare
create table t1(a int, b int);
insert into t1 select 1,1 from generate_series(1,100000) x;
insert into t1 select x,x from generate_series(1,100000) x;
update t1 set a=a+1;
explain analyze select * from t1, (select count(*),a from t1 group by a) s1 where t1.a=s1.a;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..902.00 rows=200000 width=20) (actual time=426.413..1261.467 rows=200000 loops=1)
-> Hash Join (cost=0.00..888.62 rows=50000 width=20) (actual time=447.066..929.733 rows=125120 loops=1)
-- actual time的447.066对应返回第一条记录的时间,hashjoin需要建hash表,时间大于hash算子的total time
-- actual time的929.733对应算子的total time,实际hashjoin算子执行时间为929.733-447.066=482.667ms
Hash Cond: (t1.a = t1_1.a)
-> Redistribute Motion 4:4 (slice2; segments: 4) (cost=0.00..434.32 rows=50000 width=8) (actual time=0.369..229.617 rows=125120 loops=1)
Hash Key: t1.a
-> Seq Scan on t1 (cost=0.00..432.32 rows=50000 width=8) (actual time=34.011..256.556 rows=99970 loops=1)
(seg3) InputStream Info: 163389 byte(s) (12.122 MB/s) in 12.854 ms with 20 read call(s).
(seg3) Decompress Info: byteCompress/byteDecompress 130623/156132 byte(s) (783.679 MB/s) in 0.190 ms with 14 decompress call(s).
(seg3) Cache Read 48946/48946 in 4/4 call(s) for //oushudb/default_filce/16390/14142/1196996/a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_1-0-0_20240903100009608.orc
(seg3) Cache Read 16344/33118 in 4/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_20240903100009608.log.1_1-0-0
(seg3) Cache Read 48574/48574 in 4/4 call(s) for //oushudb/default_filce/16390/14142/1196996/d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_3-0-0_20240903100009608.orc
(seg3) Cache Read 32751/32751 in 6/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_20240903100009608.log.1_3-0-0
(seg3) ReadStatsOnly Scan 0 in 0 stripe(s).
(seg3) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
(seg3) MergeRead Info [i,u,d]: [99970, 99970, 0] merged into [0, 99970] tuples in [24168, 43400, 0] ms with [50, 50, 0] calls.
-> Hash (cost=439.43..439.43 rows=6562 width=12) (actual time=446.241..446.244 rows=25120 loops=1)
-- buckets说明内存hash表的桶数,内存不足导致落盘时batches大于1,hash表根据hash值切分成batches个文件。
Buckets: 131072 Batches: 1 Memory Usage: 2148884kB
-> HashAggregate (cost=0.00..439.43 rows=6562 width=12) (actual time=381.532..418.410 rows=25120 loops=1)
Group Key: t1_1.a
-- 不同类型的节点输出的信息可能不同,hashagg显示建hash表和聚集操作的相关信息
-- hash table(s)说明agg使用的hash表个数,普通agg一般为1。ap函数因为对不同分组计算,会产生多个hash表
-- 比如group by grouping sets((a,b),(a))
-- -> HashAggregate (cost=751.00..1337.26 rows=46126 width=8) (actual time=80.245..145.887 rows=50002 loops=1)
-- Hash Key: a, b
-- Hash Key: a
-- (seg0) hash table(s): 2; chain length 3.5 avg, 17 max; using 25001 of 32768 buckets; total 0 expansions.
-- chian说明hash碰撞情况,发生碰撞的链表最大有19条记录,平均有3.5条。
-- 如果发生落盘则会额外记录落盘情况,关键词为spill partitions; disk usage
(seg0) hash table(s): 1; chain length 3.5 avg, 19 max; using 25007 of 32768 buckets; total 2 expansions.
-> Redistribute Motion 4:4 (slice3; segments: 4) (cost=0.00..433.32 rows=50000 width=4) (actual time=34.910..240.886 rows=125120 loops=1)
Hash Key: t1_1.a
-- 计划内存在同名的表,所以后续重复的表会添加下标来区分,方便和原始SQL对应
-> Seq Scan on t1 t1_1 (cost=0.00..432.32 rows=50000 width=4) (actual time=33.740..224.617 rows=99970 loops=1)
-- seg3说明本次执行过程中该节点返回行数最多
-- InputStream Info是horc表的IO信息,40.314ms从18个stripe读了131051字节数据
(seg3) InputStream Info: 131051 byte(s) (3.100 MB/s) in 40.314 ms with 18 read call(s).
-- 数据解压缩耗时0.147ms,9821字节的数据通过12次解压缩操作完成
(seg3) Decompress Info: byteCompress/byteDecompress 98291/123710 byte(s) (802.579 MB/s) in 0.147 ms with 12 decompress call(s).
-- 文件对应的cache信息,命中文件cache后不再需要从hdfs读取
(seg3) Cache Read 32684/32684 in 3/3 call(s) for //oushudb/default_filce/16390/14142/1196996/a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_1-0-0_20240903100009608.orc
(seg3) Cache Read 16774/33118 in 2/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_20240903100009608.log.1_1-0-0
(seg3) Cache Read 32498/32498 in 3/3 call(s) for //oushudb/default_filce/16390/14142/1196996/d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_3-0-0_20240903100009608.orc
(seg3) Cache Read 0/32751 in 0/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_20240903100009608.log.1_3-0-0
-- 是否只读了horc文件的统计信息,当前sql不满足只读统计信息的要求
(seg3) ReadStatsOnly Scan 0 in 0 stripe(s).
(seg3) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
-- Merge-On-Read信息,可以根据该信息判断是否存在放大写
-- 节点读取了insert插入的99970条数据,又merge了update更新的99970条数据,最终结果为99970条数据。
-- 处理insert数据耗时21.717ms,处理update数据耗时34.751ms
(seg3) MergeRead Info [i,u,d]: [99970, 99970, 0] merged into [0, 99970] tuples in [21.717, 34.751, 0] ms with [50, 50, 0] calls.
Optimizer: GPORCA
-- cache命中率
Cache Locality Ratio: 0.78 (446.0 kB/574.5 kB)
-- 优化器生成计划耗时
Planning Time: 109.879 ms
-- 各slice内存使用情况,slice1因为包含hashjoin和hashagg算子,所以内存使用较高
(slice0) Executor memory: 256K bytes.
* (slice1) Executor memory: 4178K bytes avg x 4 workers, 4184K bytes max (seg3). Work_mem: 4881K bytes max, 4881K bytes wanted.
(slice2) Executor memory: 95K bytes avg x 4 workers, 95K bytes max (seg0).
(slice3) Executor memory: 97K bytes avg x 4 workers, 97K bytes max (seg0).
Memory used: 262144kB
Memory wanted: 10360kB
-- 执行器耗时时间1.297(执行器总体耗时) = 0.001(执行器结束阶段) + 1.297(执行器执行阶段) s
Execution Time: 1.297 = 0.001 + 1.297 s
(45 rows)
新执行器不再使用tuple行数据进行计算,使用了batch来计算一批数据,所以EXPLAIN ANALYZE中的信息会有一些区别。下面计划对差异点进行说明
set enable_data_cache to off;
set new_executor to on;
set new_executor_enable_read_statistics = on;
explain analyze select * from t1, (select count(*) a from t1) s1 where t1.a=s1.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..887.92 rows=200000 width=16) (actual time=280.487..280.487 rows=0 loops=1)
InputStream Info: 0 byte(s) (nan GB/s) in 0.000 ms with 0 read call(s).
-> Hash Join (cost=0.00..877.21 rows=50000 width=16) (actual time=0.000..267.962 rows=0 loops=1)
Hash Cond: ((t1.a)::bigint = (count()))
-> Seq Scan on t1 (cost=0.00..432.32 rows=50000 width=8) (actual time=22.335..41.221 rows=99860 loops=1)
-- seg1返回最多的数据,50个batch返回了99860行数据,最多的batch有2048行数据,最少的有609行,平均是1997.2行
(seg1) 50 TupleBatch of 99860 Tuple, (1997.2/609/2048)(avg/min/max) Tuple/TupleBatch, 783 KB
-- 是否跳过了ctid,ctid是系统列,如果没有指定需要输出ctid那么通常都可以skipctid
-- explain (analyze, verbose) select t1.ctid,* from t1, (select count(*) a from t1) s1 where t1.a=s1.a;
-- -> Seq Scan on public.t1 (cost=0.00..431.66 rows=25000 width=24) (actual time=24.148..65.186 rows=100030 loops=1)
-- Output: t1.a, t1.b, t1.ctid
-- (seg3) 52 TupleBatch of 100030 Tuple, (1923.65/265/2048)(avg/min/max) Tuple/TupleBatch, 2.29 MB
-- (seg3) InputStream Info: 97456 byte(s) (2.511 MB/s) in 37.016 ms with 16 read call(s).
(seg1) SkipCtid: on
(seg1) InputStream Info: 97510 byte(s) (4.152 MB/s) in 22.396 ms with 8 read call(s).
(seg1) Decompress Info: byteCompress/byteDecompress 64852/65198 byte(s) (654.502 MB/s) in 0.095 ms with 6 decompress call(s).
(seg1) ReadStatsOnly Scan 0 in 0 stripe(s).
(seg1) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
(seg1) MergeRead Info [i,u,d]: [99860, 0, 0] merged into [99860, 0] tuples in [0, 0, 0] ms with [50, 0, 0] calls.
-> Hash (cost=432.41..432.41 rows=1 width=8) (actual time=0.000..212.683 rows=0 loops=1)
-> Broadcast Motion 1:4 (slice2) (cost=0.00..432.41 rows=1 width=8) (actual time=209.089..209.132 rows=1 loops=1)
(seg0) 1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B
(seg0) InputStream Info: 65 byte(s) (1.722 MB/s) in 0.036 ms with 1 read call(s).
-- 两阶段agg优化,如果distinct值较少,那么提前聚集可以减少motion数据量。
-- eager agg策略改写SQL也可以实现类似的功能
-> Finalize Aggregate (cost=0.00..432.41 rows=1 width=8) (actual time=0.433..0.434 rows=1 loops=1)
1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B
-> Gather Motion 4:1 (slice3; segments: 4) (cost=0.00..432.41 rows=1 width=8) (actual time=0.083..0.289 rows=4 loops=1)
4 TupleBatch of 4 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 188 B
InputStream Info: 257 byte(s) (1.763 MB/s) in 0.139 ms with 4 read call(s).
-> Partial Aggregate (cost=0.00..432.41 rows=1 width=8) (actual time=24.553..24.556 rows=1 loops=1)
(seg0) 1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B
-> Seq Scan on t1 t1_1 (cost=0.00..432.32 rows=50000 width=1) (actual time=30.738..44.003 rows=99860 loops=1)
(seg1) 2 TupleBatch of 99860 Tuple, (49930/2048/50099)(avg/min/max) Tuple/TupleBatch, 0 B
(seg1) ReadStatsOnly: on
(seg1) SkipCtid: on
(seg1) InputStream Info: 32866 byte(s) (1.584 MB/s) in 19.789 ms with 4 read call(s).
(seg1) Decompress Info: byteCompress/byteDecompress 290/312 byte(s) (4.312 MB/s) in 0.069 ms with 4 decompress call(s).
-- 因为子查询只计算了count(*)不涉及其他列的数据,所以可以只读统计信息
-- 当前仅新执行器支持,并且需要设置参数new_executor_enable_read_statistics
(seg1) ReadStatsOnly Scan 2 in 2 stripe(s).
(seg1) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s).
-- 再次读取时horc文件已经发生合并,所以仅有insert数据
(seg1) MergeRead Info [i,u,d]: [99860, 0, 0] merged into [99860, 0] tuples in [0, 0, 0] ms with [2, 0, 0] calls.
Optimizer: GPORCA
NewExecutor: ON
Cache Locality Ratio: 0.00 (0.0 B/254.5 kB)
Planning Time: 129.084 ms
(slice0) Executor memory: 66K bytes.
(slice1) Executor memory: 106K bytes avg x 4 workers, 106K bytes max (seg0). Work_mem: 49K bytes max.
(slice2) Executor memory: 210K bytes (entry db).
(slice3) Executor memory: 103K bytes avg x 4 workers, 103K bytes max (seg0).
Memory used: 262144kB
Execution Time: 0.385 = 0.000 + 0.385 s
(47 rows)
2. 相关guc参数
2.1 gp_enable_explain_allstat
gp_enable_explain_allstat可以设置为true or false,用于打印每个segment的执行耗时详情。如果怀疑数据倾斜情况比较严重,可以设置此参数来分析每个segment节点计算量,但是节点数目较多的情况下可能导致计划行数很多。
-- 信息过多,只对主要信息进行说明
explain analyze select * from t1, (select count(*),a from t1 group by a) s1 where t1.a=s1.a;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..879.82 rows=100000 width=20) (actual time=700.442..1040.044 rows=200000 loops=1)
allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
seg-1 rows_out 200000, start_offset_by 69 ms, 700 ms to_first, 1040 ms to_end, in 1 loop
-> Hash Join (cost=0.00..873.13 rows=25000 width=20) (actual time=701.297..920.036 rows=75169 loops=1)
Hash Cond: (t1.a = t1_1.a)
allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
-- 4个节点的stats信息,行数上看存在一定不均,不过并不严重。最多的是seg1有75169行,最少的是seg3有24901行
-- start_offset_by说明执行器初始化的时间,一般比较短
-- to_first说明返回第一条数据的时间,to_end则是算子总体的执行时间
-- hashjoin要等待hash算子建完hash表,所以to_first会较大
seg0 rows_out 49922, start_offset_by 71 ms, 692 ms to_first, 863 ms to_end, in 1 loop
seg1 rows_out 75169, start_offset_by 70 ms, 701 ms to_first, 920 ms to_end, in 1 loop
seg2 rows_out 50008, start_offset_by 70 ms, 702 ms to_first, 868 ms to_end, in 1 loop
seg3 rows_out 24901, start_offset_by 73 ms, 692 ms to_first, 776 ms to_end, in 1 loop
-> Seq Scan on public.t1 (cost=0.00..431.66 rows=25000 width=8) (actual time=43.003..132.725 rows=75169 loops=1)
...
-> Hash (cost=434.88..434.88 rows=1 width=12) (actual time=656.866..656.867 rows=100000 loops=1)
allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
seg0 rows_out 100000, start_offset_by 72 ms, 657 ms to_first, 657 ms to_end, in 1 loop
seg1 rows_out 100000, start_offset_by 75 ms, 653 ms to_first, 653 ms to_end, in 1 loop
seg2 rows_out 100000, start_offset_by 71 ms, 669 ms to_first, 669 ms to_end, in 1 loop
seg3 rows_out 100000, start_offset_by 75 ms, 652 ms to_first, 652 ms to_end, in 1 loop
-> Broadcast Motion 4:4 (slice2; segments: 4) (cost=0.00..434.88 rows=1 width=12) (actual time=369.708..566.214 rows=100000 loops=1)
...
-> Finalize HashAggregate (cost=0.00..434.88 rows=1 width=12) (actual time=371.146..409.245 rows=25119 loops=1)
Output: t1_1.a, count()
Group Key: t1_1.a
work_mem: 19617kB Segments: 4 Max: 4905kB (segment 0) Workfile: (0 spilling)
Work_mem wanted: 4905K bytes avg, 4905K bytes max (seg0) to lessen workfile I/O affecting 4 workers.
-- 所有节点的extra信息也会打印出来,而不是只打印返回行数最多节点的信息。
(seg0) hash table(s): 1; chain length 3.5 avg, 19 max; using 25006 of 32768 buckets; total 14 expansions.
(seg2) hash table(s): 1; chain length 3.5 avg, 13 max; using 24973 of 32768 buckets; total 14 expansions.
(seg1) hash table(s): 1; chain length 3.5 avg, 14 max; using 24902 of 32768 buckets; total 14 expansions.
(seg3) hash table(s): 1; chain length 3.5 avg, 16 max; using 25119 of 32768 buckets; total 14 expansions.
allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops
seg0 rows_out 25006, start_offset_by 102 ms, 368 ms to_first, 397 ms to_end, in 1 loop
seg2 rows_out 24973, start_offset_by 98 ms, 377 ms to_first, 410 ms to_end, in 1 loop
seg1 rows_out 24902, start_offset_by 102 ms, 370 ms to_first, 397 ms to_end, in 1 loop
seg3 rows_out 25119, start_offset_by 99 ms, 371 ms to_first, 409 ms to_end, in 1 loop
...
2.2 explain_memory_verbosity
设置为SUPPRESS, SUMMARY, and DETAIL。用于打印内存使用详情。
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..879.82 rows=100000 width=20) (actual time=767.486..1212.341 rows=200000 loops=1)
Executor Memory: 33kB Segments: 1 Max: 33kB (segment -1)
-> Hash Join (cost=0.00..873.13 rows=25000 width=20) (actual time=771.139..1001.536 rows=75169 loops=1)
Hash Cond: (t1.a = t1_1.a)
-- 设置参数后会打印算子的内存信息以及最大值
Executor Memory: 23330kB Segments: 4 Max: 5833kB (segment 0)
...
Optimizer: GPORCA
Cache Locality Ratio: 0.00 (0.0 B/318.1 kB)
Settings: enable_data_cache = 'off'
Planning Time: 70.609 ms
-- 设置参数后会额外通过vmem字段显示slice内存整体的分配情况
-- 内存是通过内存上下文管理,当前内存上下文模式是多个内存块组成的allocset,避免了小内存的频繁释放与申请
-- 内存上下文会提前分配好allocset,vmem记录了真实malloc的大小。
-- executor memory是执行器实际计算用到的内存
(slice0) Executor memory: 75K bytes. Vmem reserved: 59392K bytes.
(slice1) Executor memory: 5843K bytes avg x 4 workers, 5843K bytes max (seg0). Vmem reserved: 21504K bytes avg x 4 workers, 21504K bytes max (seg0).
* (slice2) Executor memory: 2044K bytes avg x 4 workers, 2050K bytes max (seg3). Vmem reserved: 18432K bytes avg x 4 workers, 18432K bytes max (seg0). Work_mem: 4905K bytes max, 4905K bytes wanted.
* (slice3) Executor memory: 2135K bytes avg x 4 workers, 4071K bytes max (seg0). Vmem reserved: 18944K bytes avg x 4 workers, 22528K bytes max (seg0). Work_mem: 7721K bytes max, 7721K bytes wanted.
Memory used: 262144kB
Memory wanted: 23760kB
Execution Time: 1.201 = 0.000 + 1.202 ms
(162 rows)
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




