一、场景
对于两个大表关联的场景,如果过滤条件的列值,存在高度倾斜,可以考虑根据反向滤值,进行过滤操作,减少连接的CPU时间。
二、数据准备
-- 状态表 tp01_state 记录 大表tp01 记录的多种状态
kingbase=# select count(*) from tp01;
count
----------
10000000
(1 行记录)
--只有一个高度倾斜的列值
kingbase=# select issuc,count(*) from tp01_state group by issuc order by issuc;
issuc | count
-------+---------
N | 100
Y | 9999900
(2 行记录)
三、SQL优化
1、优化前
原始SQL
多数数据匹配issuc='Y' 条件,执行计划就是两个大表,进行hashjoin。
select * from tp01 where id in (select id from tp01_state where issuc = 'Y');
--或者
select * from tp01 where exists (select 1 from tp01_state where id = tp01.id and issuc = 'Y');
执行计划:
Hash Semi Join (cost=338555.00..1033383.15 rows=10000000 width=241) (actual time=2398.867..5889.537 rows=9999900 loops=1)
Hash Cond: (tp01.id = tp01_state.id)
-> Seq Scan on tp01 (cost=0.00..444828.12 rows=10000012 width=241) (actual time=0.005..611.596 rows=10000000 loops=1)
-> Hash (cost=213555.00..213555.00 rows=10000000 width=4) (actual time=2384.857..2384.858 rows=9999900 loops=1)
Buckets: 16777216 Batches: 1 Memory Usage: 482631kB
-> Seq Scan on tp01_state (cost=0.00..213555.00 rows=10000000 width=4) (actual time=0.011..775.853 rows=9999900 loops=1)
Filter: (issuc = 'Y'::text)
Rows Removed by Filter: 100
Planning Time: 0.186 ms
Execution Time: 6137.233 ms问题解析:tp_state 数据千万级,从执行计划可以看出,在inner table创建哈希表这一环节,优化器就耗时1.6s (2384ms - 775ms),而Hash Semi Join 这环节也耗时了近 2.9s
PS:这里关键是tp01_state作为hash table,其结果集过大,导致hash 与比较操作消耗大量时间。
2、优化1:使用NOT IN 代替 IN
因为只有少量数据,匹配issuc='Y'反向条件,使用not in 减少大表的过滤操作。select * from tp01 where tp01.id not in (select id from tp01_state where issuc <> 'Y' or issuc is null );-- QUERY PLAN
Seq Scan on tp01 (cost=213555.00..683383.15 rows=5000006 width=241) (actual time=517.554..1629.795 rows=9999900 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 100
SubPlan 1
-> Seq Scan on tp01_state (cost=0.00..213555.00 rows=1 width=4) (actual time=271.143..517.503 rows=100 loops=1)
Filter: (issuc <> 'Y'::text)
Rows Removed by Filter: 9999900
Planning Time: 0.087 ms
Execution Time: 1870.376 ms修改后的SQL,虽然使用了filter 方式,但由于SubPlan 1 结果集很小,效率还是提高了不少。
3、优化2:使用not between 代替 <>
not between 操作可以使用索引,就可以减少子查询的执行时间。select *
from tp01
where tp01.id not in (select id from tp01_state where issuc not between 'Y' and 'Y' or issuc is null);
-- QUERY PLAN
Seq Scan on tp01 (cost=17.35..469845.50 rows=5000006 width=241) (actual time=0.098..1109.085 rows=9999900 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 100
SubPlan 1
-> Bitmap Heap Scan on tp01_state (cost=13.33..17.34 rows=1 width=4) (actual time=0.035..0.045 rows=100 loops=1)
Recheck Cond: ((issuc < 'Y'::text) OR (issuc > 'Y'::text) OR (issuc IS NULL))
Heap Blocks: exact=2
-> BitmapOr (cost=13.33..13.33 rows=1 width=0) (actual time=0.028..0.030 rows=0 loops=1)
-> Bitmap Index Scan on tp01_state_issuc (cost=0.00..4.44 rows=1 width=0) (actual time=0.020..0.020 rows=100 loops=1)
Index Cond: (issuc < 'Y'::text)
-> Bitmap Index Scan on tp01_state_issuc (cost=0.00..4.44 rows=1 width=0) (actual time=0.007..0.007 rows=0 loops=1)
Index Cond: (issuc > 'Y'::text)
-> Bitmap Index Scan on tp01_state_issuc (cost=0.00..4.44 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: (issuc IS NULL)
Planning Time: 0.109 ms
Execution Time: 1349.526 ms这里使用了条件 issuc not between 'Y' and 'Y' or issuc is null 替代 issuc <> 'Y',条件复杂化了,但是可以使用索引了。
PS:优化器实际将条件转化为不等式。这里是因为 issuc = 'Y' 的数据占据了大部分,使用索引可以跳过对 'Y' 值的访问。




