SQL优化 -- 排序列的顺序影响数据排序性能

原创金仓数据库 2023-07-25

3883

在多列排序时，排序列的顺序对于排序操作的性能影响很大。通常，将NDV值大的放在前面性能更佳。以下通过例子展示。

一、构造测试数据

create table t1(id1 text, id2 text);
insert into t1 select mod(generate_series(10000000,15000000),10000)||'0000',generate_series(10000000,15000000);
insert into t1 select (mod(generate_series(10000000,15000000),10000)+1)||'0000',generate_series(10000000,15000000);

test=# select count(distinct id1),count(distinct id2),count(*)  from t1;
 count |  count  |  count
-------+---------+----------
 10001 | 5000001 | 10000002

二、distinct 操作

1、当内存足够大时

可以看到，当内存足够大时，优化器采用hashaggr 算法，而非排序，所以性能上没有差异。

test=# explain (analyse ,buffers ) select /*+ set(work_mem 1024MB) */ distinct id1 ,id2 from t1 ;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=214003.11..266948.84 rows=5294573 width=17) (actual time=6679.322..10131.612 rows=10000002 loops=1)
   Group Key: id1, id2
   Buffers: shared hit=64 read=63937
   ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.092..1071.132 rows=10000002 loops=1)
         Buffers: shared hit=64 read=63937
 Planning Time: 0.074 ms
 Execution Time: 10530.949 ms
(7 行记录)

test=# explain (analyse ,buffers ) select /*+ set(work_mem 1024MB) */ distinct id2,id1 from t1 ;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=214003.11..266948.84 rows=5294573 width=17) (actual time=6638.296..9819.373 rows=10000002 loops=1)
   Group Key: id2, id1
   Buffers: shared hit=96 read=63905
   ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.088..1068.998 rows=10000002 loops=1)
         Buffers: shared hit=96 read=63905
 Planning Time: 0.082 ms
 Execution Time: 10199.573 ms
(7 行记录)

2、当内存较小时

可以看到id2 （distinct 值更多）在前，性能更佳

test=# explain (analyse ,buffers,verbose ) select /*+ set(work_mem 100MB) */ distinct id2 ,id1 from t1 ;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1531777.15..1606778.21 rows=5294573 width=17) (actual time=8393.168..14878.930 rows=10000002 loops=1)
   Output: id2, id1
   Buffers: shared hit=128 read=63873, temp read=102913 written=103170
   ->  Sort  (cost=1531777.15..1556777.51 rows=10000141 width=17) (actual time=8393.166..12881.589 rows=10000002 loops=1)
         Output: id2, id1
         Sort Key: t1.id2, t1.id1
         Sort Method: external merge  Disk: 272936kB
         Buffers: shared hit=128 read=63873, temp read=102913 written=103170
         ->  Seq Scan on public.t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.047..1154.290 rows=10000002 loops=1)
               Output: id2, id1
               Buffers: shared hit=128 read=63873
 Planning Time: 0.111 ms
 Execution Time: 15264.548 ms
(13 行记录)

test=# explain (analyse ,buffers,verbose ) select /*+ set(work_mem 100MB) */ distinct id1,id2 from t1 ;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1531777.15..1606778.21 rows=5294573 width=17) (actual time=23940.932..31830.909 rows=10000002 loops=1)
   Output: id1, id2
   Buffers: shared hit=160 read=63841, temp read=102913 written=103170
   ->  Sort  (cost=1531777.15..1556777.51 rows=10000141 width=17) (actual time=23940.931..30050.674 rows=10000002 loops=1)
         Output: id1, id2
         Sort Key: t1.id1, t1.id2
         Sort Method: external merge  Disk: 272936kB
         Buffers: shared hit=160 read=63841, temp read=102913 written=103170
         ->  Seq Scan on public.t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.027..791.278 rows=10000002 loops=1)
               Output: id1, id2
               Buffers: shared hit=160 read=63841
 Planning Time: 0.063 ms
 Execution Time: 32206.516 ms
(13 行记录)

三、group by 操作

1、当内存足够大时

同样，因为内存较大，group by 操作通过hashaggr，并未排序，所以性能无差异

test=# explain (analyse ,buffers ) select /*+ set(work_mem 1024MB) */ id1 ,id2 from t1 group by id1 ,id2 ;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=214003.11..266948.84 rows=5294573 width=17) (actual time=6072.929..9338.734 rows=10000002 loops=1)
   Group Key: id1, id2
   Buffers: shared hit=192 read=63809
   ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.076..963.138 rows=10000002 loops=1)
         Buffers: shared hit=192 read=63809
 Planning Time: 0.137 ms
 Execution Time: 9706.713 ms
(7 行记录)

test=# explain (analyse ,buffers ) select /*+ set(work_mem 1024MB) */ id1 ,id2 from t1 group by id2,id1 ;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=214003.11..266948.84 rows=5294573 width=17) (actual time=5839.798..8975.304 rows=10000002 loops=1)
   Group Key: id2, id1
   Buffers: shared hit=224 read=63777
   ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.067..913.143 rows=10000002 loops=1)
         Buffers: shared hit=224 read=63777
 Planning Time: 0.081 ms
 Execution Time: 9339.602 ms
(7 行记录)

2、当内存较小时

同样，可以看到 id2 （distinct 值更多）在前，性能更佳

test=# explain (analyse ,buffers ) select /*+ set(work_mem 100MB) */ id1 ,id2 from t1 group by id1 ,id2 ;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=1531777.15..1606778.21 rows=5294573 width=17) (actual time=23888.203..32225.259 rows=10000002 loops=1)
   Group Key: id1, id2
   Buffers: shared hit=256 read=63745, temp read=102913 written=103170
   ->  Sort  (cost=1531777.15..1556777.51 rows=10000141 width=17) (actual time=23888.197..30219.007 rows=10000002 loops=1)
         Sort Key: id1, id2
         Sort Method: external merge  Disk: 272936kB
         Buffers: shared hit=256 read=63745, temp read=102913 written=103170
         ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.064..841.771 rows=10000002 loops=1)
               Buffers: shared hit=256 read=63745
 Planning Time: 0.114 ms
 Execution Time: 32600.554 ms
(11 行记录)

test=# explain (analyse ,buffers ) select /*+ set(work_mem 100MB) */ id1 ,id2 from t1 group by id2,id1 ;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=1531777.15..1606778.21 rows=5294573 width=17) (actual time=7911.096..14479.694 rows=10000002 loops=1)
   Group Key: id2, id1
   Buffers: shared hit=288 read=63713, temp read=102913 written=103170
   ->  Sort  (cost=1531777.15..1556777.51 rows=10000141 width=17) (actual time=7911.092..12323.432 rows=10000002 loops=1)
         Sort Key: id2, id1
         Sort Method: external merge  Disk: 272936kB
         Buffers: shared hit=288 read=63713, temp read=102913 written=103170
         ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.039..803.117 rows=10000002 loops=1)
               Buffers: shared hit=288 read=63713
 Planning Time: 0.081 ms
 Execution Time: 14851.643 ms
(11 行记录)

四、order by 操作

由于order by 操作不能用hashaggr ，因此，内存大小只会影响排序的性能。可以看到，在同样内存的情况下，id2 在前性能更佳。

test=# explain (analyse ,buffers ) select /*+ set(work_mem 1024MB) */ id1 ,id2 from t1 order by id2,id1 ;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1326694.65..1351695.01 rows=10000141 width=17) (actual time=7860.700..11978.304 rows=10000002 loops=1)
   Sort Key: id2, id1
   Sort Method: external merge  Disk: 272936kB
   Buffers: shared hit=320 read=63681, temp read=102913 written=103170
   ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.057..776.822 rows=10000002 loops=1)
         Buffers: shared hit=320 read=63681
 Planning Time: 0.058 ms
 Execution Time: 12341.831 ms
(8 行记录)

test=# explain (analyse ,buffers ) select /*+ set(work_mem 1024MB) */ id1 ,id2 from t1 order by id1,id2 ;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1326694.65..1351695.01 rows=10000141 width=17) (actual time=23292.115..29392.412 rows=10000002 loops=1)
   Sort Key: id1, id2
   Sort Method: external merge  Disk: 272936kB
   Buffers: shared hit=352 read=63649, temp read=102913 written=103170
   ->  Seq Scan on t1  (cost=0.00..164002.41 rows=10000141 width=17) (actual time=0.071..829.915 rows=10000002 loops=1)
         Buffers: shared hit=352 read=63649
 Planning Time: 0.103 ms
 Execution Time: 29774.684 ms
(8 行记录)