暂无图片
暂无图片
5
暂无图片
暂无图片
暂无图片

梧桐数据库:EXPLAIN ANALYZE详解

原创 2024-09-05
250

EXPLAIN简介文章中介绍了EXPALIN的几种选项和计划的基础字段,本文继续对复杂场景下EXPLAIN ANALYZE字段进行说明。

1. EXPLAIN ANALYZE字段说明

不同存储类型的表可能显示信息不同,下面以HORC表为例进行说明

-- prepare create table t1(a int, b int); insert into t1 select 1,1 from generate_series(1,100000) x; insert into t1 select x,x from generate_series(1,100000) x; update t1 set a=a+1; explain analyze select * from t1, (select count(*),a from t1 group by a) s1 where t1.a=s1.a; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..902.00 rows=200000 width=20) (actual time=426.413..1261.467 rows=200000 loops=1) -> Hash Join (cost=0.00..888.62 rows=50000 width=20) (actual time=447.066..929.733 rows=125120 loops=1) -- actual time的447.066对应返回第一条记录的时间,hashjoin需要建hash表,时间大于hash算子的total time -- actual time的929.733对应算子的total time,实际hashjoin算子执行时间为929.733-447.066=482.667ms Hash Cond: (t1.a = t1_1.a) -> Redistribute Motion 4:4 (slice2; segments: 4) (cost=0.00..434.32 rows=50000 width=8) (actual time=0.369..229.617 rows=125120 loops=1) Hash Key: t1.a -> Seq Scan on t1 (cost=0.00..432.32 rows=50000 width=8) (actual time=34.011..256.556 rows=99970 loops=1) (seg3) InputStream Info: 163389 byte(s) (12.122 MB/s) in 12.854 ms with 20 read call(s). (seg3) Decompress Info: byteCompress/byteDecompress 130623/156132 byte(s) (783.679 MB/s) in 0.190 ms with 14 decompress call(s). (seg3) Cache Read 48946/48946 in 4/4 call(s) for //oushudb/default_filce/16390/14142/1196996/a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_1-0-0_20240903100009608.orc (seg3) Cache Read 16344/33118 in 4/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_20240903100009608.log.1_1-0-0 (seg3) Cache Read 48574/48574 in 4/4 call(s) for //oushudb/default_filce/16390/14142/1196996/d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_3-0-0_20240903100009608.orc (seg3) Cache Read 32751/32751 in 6/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_20240903100009608.log.1_3-0-0 (seg3) ReadStatsOnly Scan 0 in 0 stripe(s). (seg3) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s). (seg3) MergeRead Info [i,u,d]: [99970, 99970, 0] merged into [0, 99970] tuples in [24168, 43400, 0] ms with [50, 50, 0] calls. -> Hash (cost=439.43..439.43 rows=6562 width=12) (actual time=446.241..446.244 rows=25120 loops=1) -- buckets说明内存hash表的桶数,内存不足导致落盘时batches大于1,hash表根据hash值切分成batches个文件。 Buckets: 131072 Batches: 1 Memory Usage: 2148884kB -> HashAggregate (cost=0.00..439.43 rows=6562 width=12) (actual time=381.532..418.410 rows=25120 loops=1) Group Key: t1_1.a -- 不同类型的节点输出的信息可能不同,hashagg显示建hash表和聚集操作的相关信息 -- hash table(s)说明agg使用的hash表个数,普通agg一般为1。ap函数因为对不同分组计算,会产生多个hash表 -- 比如group by grouping sets((a,b),(a)) -- -> HashAggregate (cost=751.00..1337.26 rows=46126 width=8) (actual time=80.245..145.887 rows=50002 loops=1) -- Hash Key: a, b -- Hash Key: a -- (seg0) hash table(s): 2; chain length 3.5 avg, 17 max; using 25001 of 32768 buckets; total 0 expansions. -- chian说明hash碰撞情况,发生碰撞的链表最大有19条记录,平均有3.5条。 -- 如果发生落盘则会额外记录落盘情况,关键词为spill partitions; disk usage (seg0) hash table(s): 1; chain length 3.5 avg, 19 max; using 25007 of 32768 buckets; total 2 expansions. -> Redistribute Motion 4:4 (slice3; segments: 4) (cost=0.00..433.32 rows=50000 width=4) (actual time=34.910..240.886 rows=125120 loops=1) Hash Key: t1_1.a -- 计划内存在同名的表,所以后续重复的表会添加下标来区分,方便和原始SQL对应 -> Seq Scan on t1 t1_1 (cost=0.00..432.32 rows=50000 width=4) (actual time=33.740..224.617 rows=99970 loops=1) -- seg3说明本次执行过程中该节点返回行数最多 -- InputStream Info是horc表的IO信息,40.314ms从18个stripe读了131051字节数据 (seg3) InputStream Info: 131051 byte(s) (3.100 MB/s) in 40.314 ms with 18 read call(s). -- 数据解压缩耗时0.147ms,9821字节的数据通过12次解压缩操作完成 (seg3) Decompress Info: byteCompress/byteDecompress 98291/123710 byte(s) (802.579 MB/s) in 0.147 ms with 12 decompress call(s). -- 文件对应的cache信息,命中文件cache后不再需要从hdfs读取 (seg3) Cache Read 32684/32684 in 3/3 call(s) for //oushudb/default_filce/16390/14142/1196996/a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_1-0-0_20240903100009608.orc (seg3) Cache Read 16774/33118 in 2/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.a5119514-0b0e-460c-b1f2-d6c0e6f52eb4_20240903100009608.log.1_1-0-0 (seg3) Cache Read 32498/32498 in 3/3 call(s) for //oushudb/default_filce/16390/14142/1196996/d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_3-0-0_20240903100009608.orc (seg3) Cache Read 0/32751 in 0/6 call(s) for //oushudb/default_filce/16390/14142/1196996/.d37e29f2-e6f9-4cb3-8cf5-724fc8b02921_20240903100009608.log.1_3-0-0 -- 是否只读了horc文件的统计信息,当前sql不满足只读统计信息的要求 (seg3) ReadStatsOnly Scan 0 in 0 stripe(s). (seg3) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s). -- Merge-On-Read信息,可以根据该信息判断是否存在放大写 -- 节点读取了insert插入的99970条数据,又merge了update更新的99970条数据,最终结果为99970条数据。 -- 处理insert数据耗时21.717ms,处理update数据耗时34.751ms (seg3) MergeRead Info [i,u,d]: [99970, 99970, 0] merged into [0, 99970] tuples in [21.717, 34.751, 0] ms with [50, 50, 0] calls. Optimizer: GPORCA -- cache命中率 Cache Locality Ratio: 0.78 (446.0 kB/574.5 kB) -- 优化器生成计划耗时 Planning Time: 109.879 ms -- 各slice内存使用情况,slice1因为包含hashjoin和hashagg算子,所以内存使用较高 (slice0) Executor memory: 256K bytes. * (slice1) Executor memory: 4178K bytes avg x 4 workers, 4184K bytes max (seg3). Work_mem: 4881K bytes max, 4881K bytes wanted. (slice2) Executor memory: 95K bytes avg x 4 workers, 95K bytes max (seg0). (slice3) Executor memory: 97K bytes avg x 4 workers, 97K bytes max (seg0). Memory used: 262144kB Memory wanted: 10360kB -- 执行器耗时时间1.297(执行器总体耗时) = 0.001(执行器结束阶段) + 1.297(执行器执行阶段) s Execution Time: 1.297 = 0.001 + 1.297 s (45 rows)

新执行器不再使用tuple行数据进行计算,使用了batch来计算一批数据,所以EXPLAIN ANALYZE中的信息会有一些区别。下面计划对差异点进行说明

set enable_data_cache to off; set new_executor to on; set new_executor_enable_read_statistics = on; explain analyze select * from t1, (select count(*) a from t1) s1 where t1.a=s1.a; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..887.92 rows=200000 width=16) (actual time=280.487..280.487 rows=0 loops=1) InputStream Info: 0 byte(s) (nan GB/s) in 0.000 ms with 0 read call(s). -> Hash Join (cost=0.00..877.21 rows=50000 width=16) (actual time=0.000..267.962 rows=0 loops=1) Hash Cond: ((t1.a)::bigint = (count())) -> Seq Scan on t1 (cost=0.00..432.32 rows=50000 width=8) (actual time=22.335..41.221 rows=99860 loops=1) -- seg1返回最多的数据,50个batch返回了99860行数据,最多的batch有2048行数据,最少的有609行,平均是1997.2行 (seg1) 50 TupleBatch of 99860 Tuple, (1997.2/609/2048)(avg/min/max) Tuple/TupleBatch, 783 KB -- 是否跳过了ctid,ctid是系统列,如果没有指定需要输出ctid那么通常都可以skipctid -- explain (analyze, verbose) select t1.ctid,* from t1, (select count(*) a from t1) s1 where t1.a=s1.a; -- -> Seq Scan on public.t1 (cost=0.00..431.66 rows=25000 width=24) (actual time=24.148..65.186 rows=100030 loops=1) -- Output: t1.a, t1.b, t1.ctid -- (seg3) 52 TupleBatch of 100030 Tuple, (1923.65/265/2048)(avg/min/max) Tuple/TupleBatch, 2.29 MB -- (seg3) InputStream Info: 97456 byte(s) (2.511 MB/s) in 37.016 ms with 16 read call(s). (seg1) SkipCtid: on (seg1) InputStream Info: 97510 byte(s) (4.152 MB/s) in 22.396 ms with 8 read call(s). (seg1) Decompress Info: byteCompress/byteDecompress 64852/65198 byte(s) (654.502 MB/s) in 0.095 ms with 6 decompress call(s). (seg1) ReadStatsOnly Scan 0 in 0 stripe(s). (seg1) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s). (seg1) MergeRead Info [i,u,d]: [99860, 0, 0] merged into [99860, 0] tuples in [0, 0, 0] ms with [50, 0, 0] calls. -> Hash (cost=432.41..432.41 rows=1 width=8) (actual time=0.000..212.683 rows=0 loops=1) -> Broadcast Motion 1:4 (slice2) (cost=0.00..432.41 rows=1 width=8) (actual time=209.089..209.132 rows=1 loops=1) (seg0) 1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B (seg0) InputStream Info: 65 byte(s) (1.722 MB/s) in 0.036 ms with 1 read call(s). -- 两阶段agg优化,如果distinct值较少,那么提前聚集可以减少motion数据量。 -- eager agg策略改写SQL也可以实现类似的功能 -> Finalize Aggregate (cost=0.00..432.41 rows=1 width=8) (actual time=0.433..0.434 rows=1 loops=1) 1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B -> Gather Motion 4:1 (slice3; segments: 4) (cost=0.00..432.41 rows=1 width=8) (actual time=0.083..0.289 rows=4 loops=1) 4 TupleBatch of 4 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 188 B InputStream Info: 257 byte(s) (1.763 MB/s) in 0.139 ms with 4 read call(s). -> Partial Aggregate (cost=0.00..432.41 rows=1 width=8) (actual time=24.553..24.556 rows=1 loops=1) (seg0) 1 TupleBatch of 1 Tuple, (1/1/1)(avg/min/max) Tuple/TupleBatch, 47 B -> Seq Scan on t1 t1_1 (cost=0.00..432.32 rows=50000 width=1) (actual time=30.738..44.003 rows=99860 loops=1) (seg1) 2 TupleBatch of 99860 Tuple, (49930/2048/50099)(avg/min/max) Tuple/TupleBatch, 0 B (seg1) ReadStatsOnly: on (seg1) SkipCtid: on (seg1) InputStream Info: 32866 byte(s) (1.584 MB/s) in 19.789 ms with 4 read call(s). (seg1) Decompress Info: byteCompress/byteDecompress 290/312 byte(s) (4.312 MB/s) in 0.069 ms with 4 decompress call(s). -- 因为子查询只计算了count(*)不涉及其他列的数据,所以可以只读统计信息 -- 当前仅新执行器支持,并且需要设置参数new_executor_enable_read_statistics (seg1) ReadStatsOnly Scan 2 in 2 stripe(s). (seg1) Predicate Info: Expect to scan 0 stripe(s). Filter out 0 stripe(s). -- 再次读取时horc文件已经发生合并,所以仅有insert数据 (seg1) MergeRead Info [i,u,d]: [99860, 0, 0] merged into [99860, 0] tuples in [0, 0, 0] ms with [2, 0, 0] calls. Optimizer: GPORCA NewExecutor: ON Cache Locality Ratio: 0.00 (0.0 B/254.5 kB) Planning Time: 129.084 ms (slice0) Executor memory: 66K bytes. (slice1) Executor memory: 106K bytes avg x 4 workers, 106K bytes max (seg0). Work_mem: 49K bytes max. (slice2) Executor memory: 210K bytes (entry db). (slice3) Executor memory: 103K bytes avg x 4 workers, 103K bytes max (seg0). Memory used: 262144kB Execution Time: 0.385 = 0.000 + 0.385 s (47 rows)

2. 相关guc参数

2.1 gp_enable_explain_allstat

gp_enable_explain_allstat可以设置为true or false,用于打印每个segment的执行耗时详情。如果怀疑数据倾斜情况比较严重,可以设置此参数来分析每个segment节点计算量,但是节点数目较多的情况下可能导致计划行数很多。

-- 信息过多,只对主要信息进行说明 explain analyze select * from t1, (select count(*),a from t1 group by a) s1 where t1.a=s1.a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..879.82 rows=100000 width=20) (actual time=700.442..1040.044 rows=200000 loops=1) allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops seg-1 rows_out 200000, start_offset_by 69 ms, 700 ms to_first, 1040 ms to_end, in 1 loop -> Hash Join (cost=0.00..873.13 rows=25000 width=20) (actual time=701.297..920.036 rows=75169 loops=1) Hash Cond: (t1.a = t1_1.a) allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops -- 4个节点的stats信息,行数上看存在一定不均,不过并不严重。最多的是seg1有75169行,最少的是seg3有24901行 -- start_offset_by说明执行器初始化的时间,一般比较短 -- to_first说明返回第一条数据的时间,to_end则是算子总体的执行时间 -- hashjoin要等待hash算子建完hash表,所以to_first会较大 seg0 rows_out 49922, start_offset_by 71 ms, 692 ms to_first, 863 ms to_end, in 1 loop seg1 rows_out 75169, start_offset_by 70 ms, 701 ms to_first, 920 ms to_end, in 1 loop seg2 rows_out 50008, start_offset_by 70 ms, 702 ms to_first, 868 ms to_end, in 1 loop seg3 rows_out 24901, start_offset_by 73 ms, 692 ms to_first, 776 ms to_end, in 1 loop -> Seq Scan on public.t1 (cost=0.00..431.66 rows=25000 width=8) (actual time=43.003..132.725 rows=75169 loops=1) ... -> Hash (cost=434.88..434.88 rows=1 width=12) (actual time=656.866..656.867 rows=100000 loops=1) allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops seg0 rows_out 100000, start_offset_by 72 ms, 657 ms to_first, 657 ms to_end, in 1 loop seg1 rows_out 100000, start_offset_by 75 ms, 653 ms to_first, 653 ms to_end, in 1 loop seg2 rows_out 100000, start_offset_by 71 ms, 669 ms to_first, 669 ms to_end, in 1 loop seg3 rows_out 100000, start_offset_by 75 ms, 652 ms to_first, 652 ms to_end, in 1 loop -> Broadcast Motion 4:4 (slice2; segments: 4) (cost=0.00..434.88 rows=1 width=12) (actual time=369.708..566.214 rows=100000 loops=1) ... -> Finalize HashAggregate (cost=0.00..434.88 rows=1 width=12) (actual time=371.146..409.245 rows=25119 loops=1) Output: t1_1.a, count() Group Key: t1_1.a work_mem: 19617kB Segments: 4 Max: 4905kB (segment 0) Workfile: (0 spilling) Work_mem wanted: 4905K bytes avg, 4905K bytes max (seg0) to lessen workfile I/O affecting 4 workers. -- 所有节点的extra信息也会打印出来,而不是只打印返回行数最多节点的信息。 (seg0) hash table(s): 1; chain length 3.5 avg, 19 max; using 25006 of 32768 buckets; total 14 expansions. (seg2) hash table(s): 1; chain length 3.5 avg, 13 max; using 24973 of 32768 buckets; total 14 expansions. (seg1) hash table(s): 1; chain length 3.5 avg, 14 max; using 24902 of 32768 buckets; total 14 expansions. (seg3) hash table(s): 1; chain length 3.5 avg, 16 max; using 25119 of 32768 buckets; total 14 expansions. allstat: seg n_tuples start_offset timing_to_first_row timing_to_last_row n_loops seg0 rows_out 25006, start_offset_by 102 ms, 368 ms to_first, 397 ms to_end, in 1 loop seg2 rows_out 24973, start_offset_by 98 ms, 377 ms to_first, 410 ms to_end, in 1 loop seg1 rows_out 24902, start_offset_by 102 ms, 370 ms to_first, 397 ms to_end, in 1 loop seg3 rows_out 25119, start_offset_by 99 ms, 371 ms to_first, 409 ms to_end, in 1 loop ...

2.2 explain_memory_verbosity

设置为SUPPRESS, SUMMARY, and DETAIL。用于打印内存使用详情。

QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..879.82 rows=100000 width=20) (actual time=767.486..1212.341 rows=200000 loops=1) Executor Memory: 33kB Segments: 1 Max: 33kB (segment -1) -> Hash Join (cost=0.00..873.13 rows=25000 width=20) (actual time=771.139..1001.536 rows=75169 loops=1) Hash Cond: (t1.a = t1_1.a) -- 设置参数后会打印算子的内存信息以及最大值 Executor Memory: 23330kB Segments: 4 Max: 5833kB (segment 0) ... Optimizer: GPORCA Cache Locality Ratio: 0.00 (0.0 B/318.1 kB) Settings: enable_data_cache = 'off' Planning Time: 70.609 ms -- 设置参数后会额外通过vmem字段显示slice内存整体的分配情况 -- 内存是通过内存上下文管理,当前内存上下文模式是多个内存块组成的allocset,避免了小内存的频繁释放与申请 -- 内存上下文会提前分配好allocset,vmem记录了真实malloc的大小。 -- executor memory是执行器实际计算用到的内存 (slice0) Executor memory: 75K bytes. Vmem reserved: 59392K bytes. (slice1) Executor memory: 5843K bytes avg x 4 workers, 5843K bytes max (seg0). Vmem reserved: 21504K bytes avg x 4 workers, 21504K bytes max (seg0). * (slice2) Executor memory: 2044K bytes avg x 4 workers, 2050K bytes max (seg3). Vmem reserved: 18432K bytes avg x 4 workers, 18432K bytes max (seg0). Work_mem: 4905K bytes max, 4905K bytes wanted. * (slice3) Executor memory: 2135K bytes avg x 4 workers, 4071K bytes max (seg0). Vmem reserved: 18944K bytes avg x 4 workers, 22528K bytes max (seg0). Work_mem: 7721K bytes max, 7721K bytes wanted. Memory used: 262144kB Memory wanted: 23760kB Execution Time: 1.201 = 0.000 + 1.202 ms (162 rows)
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论