SQL优化 -- 数据严重倾斜连接场景优化

原创金仓数据库 2023-05-09

2959

一、场景

对于两个大表关联的场景，如果过滤条件的列值，存在高度倾斜，可以考虑根据反向滤值，进行过滤操作，减少连接的CPU时间。

二、数据准备

-- 状态表 tp01_state 记录大表tp01 记录的多种状态
kingbase=# select count(*) from tp01;
count
----------
10000000
(1 行记录)

--只有一个高度倾斜的列值
kingbase=# select issuc,count(*) from tp01_state group by issuc order by issuc;

issuc | count
-------+---------
N | 100
Y | 9999900
(2 行记录)

三、SQL优化

1、优化前

原始SQL

多数数据匹配issuc='Y' 条件，执行计划就是两个大表，进行hashjoin。
select * from tp01 where id in (select id from tp01_state where issuc = 'Y');
--或者
select * from tp01 where exists (select 1 from tp01_state where id = tp01.id and issuc = 'Y');

执行计划：

Hash Semi Join  (cost=338555.00..1033383.15 rows=10000000 width=241) (actual time=2398.867..5889.537 rows=9999900 loops=1)
  Hash Cond: (tp01.id = tp01_state.id)
  ->  Seq Scan on tp01  (cost=0.00..444828.12 rows=10000012 width=241) (actual time=0.005..611.596 rows=10000000 loops=1)
  ->  Hash  (cost=213555.00..213555.00 rows=10000000 width=4) (actual time=2384.857..2384.858 rows=9999900 loops=1)
        Buckets: 16777216  Batches: 1  Memory Usage: 482631kB
        ->  Seq Scan on tp01_state  (cost=0.00..213555.00 rows=10000000 width=4) (actual time=0.011..775.853 rows=9999900 loops=1)
              Filter: (issuc = 'Y'::text)
              Rows Removed by Filter: 100
Planning Time: 0.186 ms
Execution Time: 6137.233 ms

问题解析：tp_state 数据千万级，从执行计划可以看出，在inner table创建哈希表这一环节，优化器就耗时1.6s (2384ms - 775ms)，而Hash Semi Join 这环节也耗时了近 2.9s

PS：这里关键是tp01_state作为hash table，其结果集过大，导致hash 与比较操作消耗大量时间。

2、优化1：使用NOT IN 代替 IN

因为只有少量数据，匹配issuc='Y'反向条件，使用not in 减少大表的过滤操作。

select * from tp01 where tp01.id not in (select id from tp01_state where issuc <> 'Y' or issuc is null );-- QUERY PLAN
Seq Scan on tp01  (cost=213555.00..683383.15 rows=5000006 width=241) (actual time=517.554..1629.795 rows=9999900 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 100
  SubPlan 1
    ->  Seq Scan on tp01_state  (cost=0.00..213555.00 rows=1 width=4) (actual time=271.143..517.503 rows=100 loops=1)
          Filter: (issuc <> 'Y'::text)
          Rows Removed by Filter: 9999900
Planning Time: 0.087 ms
Execution Time: 1870.376 ms

修改后的SQL，虽然使用了filter 方式，但由于SubPlan 1 结果集很小，效率还是提高了不少。

3、优化2：使用not between 代替 <>

not between 操作可以使用索引，就可以减少子查询的执行时间。

select *
from tp01
where tp01.id not in (select id from tp01_state where issuc not between 'Y' and 'Y' or issuc is null);

-- QUERY PLAN
Seq Scan on tp01  (cost=17.35..469845.50 rows=5000006 width=241) (actual time=0.098..1109.085 rows=9999900 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 100
  SubPlan 1
    ->  Bitmap Heap Scan on tp01_state  (cost=13.33..17.34 rows=1 width=4) (actual time=0.035..0.045 rows=100 loops=1)
          Recheck Cond: ((issuc < 'Y'::text) OR (issuc > 'Y'::text) OR (issuc IS NULL))
          Heap Blocks: exact=2
          ->  BitmapOr  (cost=13.33..13.33 rows=1 width=0) (actual time=0.028..0.030 rows=0 loops=1)
                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.020..0.020 rows=100 loops=1)
                      Index Cond: (issuc < 'Y'::text)
                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.007..0.007 rows=0 loops=1)
                      Index Cond: (issuc > 'Y'::text)
                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                      Index Cond: (issuc IS NULL)
Planning Time: 0.109 ms
Execution Time: 1349.526 ms