暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

SQL优化 -- 数据严重倾斜连接场景优化

原创 金仓数据库 2023-05-09
2959

一、场景

对于两个大表关联的场景,如果过滤条件的列值,存在高度倾斜,可以考虑根据反向滤值,进行过滤操作,减少连接的CPU时间。

二、数据准备

-- 状态表 tp01_state 记录 大表tp01 记录的多种状态
kingbase=# select count(*) from tp01;
count
----------
10000000
(1 行记录)

--只有一个高度倾斜的列值
kingbase=# select issuc,count(*) from tp01_state group by issuc order by issuc;

issuc   | count
-------+---------
N        | 100
Y         | 9999900
(2 行记录)

三、SQL优化

1、优化前

原始SQL

多数数据匹配issuc='Y' 条件,执行计划就是两个大表,进行hashjoin。
select * from tp01 where id in (select id from tp01_state where issuc = 'Y');
--或者
select * from tp01 where exists (select 1 from tp01_state where id = tp01.id and issuc = 'Y');

执行计划:

Hash Semi Join  (cost=338555.00..1033383.15 rows=10000000 width=241) (actual time=2398.867..5889.537 rows=9999900 loops=1)
  Hash Cond: (tp01.id = tp01_state.id)
  ->  Seq Scan on tp01  (cost=0.00..444828.12 rows=10000012 width=241) (actual time=0.005..611.596 rows=10000000 loops=1)
  ->  Hash  (cost=213555.00..213555.00 rows=10000000 width=4) (actual time=2384.857..2384.858 rows=9999900 loops=1)
        Buckets: 16777216  Batches: 1  Memory Usage: 482631kB
        ->  Seq Scan on tp01_state  (cost=0.00..213555.00 rows=10000000 width=4) (actual time=0.011..775.853 rows=9999900 loops=1)
              Filter: (issuc = 'Y'::text)
              Rows Removed by Filter: 100
Planning Time: 0.186 ms
Execution Time: 6137.233 ms

问题解析:tp_state 数据千万级,从执行计划可以看出,在inner table创建哈希表这一环节,优化器就耗时1.6s (2384ms - 775ms),而Hash Semi Join 这环节也耗时了近 2.9s 

PS:这里关键是tp01_state作为hash table,其结果集过大,导致hash 与比较操作消耗大量时间。

2、优化1:使用NOT IN 代替 IN

因为只有少量数据,匹配issuc='Y'反向条件,使用not in 减少大表的过滤操作。
select * from tp01 where tp01.id not in (select id from tp01_state where issuc <> 'Y' or issuc is null );-- QUERY PLAN
Seq Scan on tp01 (cost=213555.00..683383.15 rows=5000006 width=241) (actual time=517.554..1629.795 rows=9999900 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 100 SubPlan 1 -> Seq Scan on tp01_state (cost=0.00..213555.00 rows=1 width=4) (actual time=271.143..517.503 rows=100 loops=1) Filter: (issuc <> 'Y'::text) Rows Removed by Filter: 9999900 Planning Time: 0.087 ms Execution Time: 1870.376 ms

修改后的SQL,虽然使用了filter 方式,但由于SubPlan 1 结果集很小,效率还是提高了不少。

3、优化2:使用not between 代替 <>

not between 操作可以使用索引,就可以减少子查询的执行时间。
select *
from tp01
where tp01.id not in (select id from tp01_state where issuc not between 'Y' and 'Y' or issuc is null);

-- QUERY PLAN
Seq Scan on tp01  (cost=17.35..469845.50 rows=5000006 width=241) (actual time=0.098..1109.085 rows=9999900 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 100
  SubPlan 1
    ->  Bitmap Heap Scan on tp01_state  (cost=13.33..17.34 rows=1 width=4) (actual time=0.035..0.045 rows=100 loops=1)
          Recheck Cond: ((issuc < 'Y'::text) OR (issuc > 'Y'::text) OR (issuc IS NULL))
          Heap Blocks: exact=2
          ->  BitmapOr  (cost=13.33..13.33 rows=1 width=0) (actual time=0.028..0.030 rows=0 loops=1)
                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.020..0.020 rows=100 loops=1)
                      Index Cond: (issuc < 'Y'::text)
                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.007..0.007 rows=0 loops=1)
                      Index Cond: (issuc > 'Y'::text)
                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                      Index Cond: (issuc IS NULL)
Planning Time: 0.109 ms
Execution Time: 1349.526 ms

这里使用了条件 issuc not between 'Y' and 'Y' or issuc is null 替代 issuc <> 'Y',条件复杂化了,但是可以使用索引了。

PS:优化器实际将条件转化为不等式。这里是因为 issuc = 'Y' 的数据占据了大部分,使用索引可以跳过对 'Y' 值的访问。

最后修改时间:2024-08-06 08:14:13
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论