PostgreSQL简单验证，似是而非union all后order by取limit

开源软件联盟PostgreSQL分会 2020-08-11

1307

文章之前的描述存在不少问题，今天重新改写了些内容。

开篇之前还是先来回顾一下这个错误的MySQL优化案例：

开发同学期望将两个表的数据汇集之后，取满足条件的top N，于是使用union all联合两个结果集，最后使用limit取top 30。两个表的数据量接近500w，执行耗时12.66秒：

整体来看，性能比较差，下面这种优化是期望将limit推入到两个联合的子句中来减少联合对性能影响，执行耗时在0.03秒，效果还不错。

但这里用到的优化思路看起来好像没有问题，但逻辑的实现是有问题的，而且改写前原SQL的逻辑也是存在问题的，具体是什么问题，看完下面的实验之后，相信也就自然明白了。

下面我们通过PostgreSQL来演示一下这里埋的是什么坑。MySQL和Oracle应该也是一样的效果。时间有限，就不做过多对比了。

先来创建两个测试表：

akendb=# \d+ tab01 
                                    Table "public.tab01"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 id        | integer |           | not null | 0       | plain   |              |
 num_col01 | integer |           | not null | 0       | plain   |              | 
akendb=# \d+ tab02
                                    Table "public.tab02"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 id        | integer |           | not null | 0       | plain   |              | 
 num_col01 | integer |           | not null | 0       | plain   |              |


akendb=#

然后每张表简单插入3行数据即可：

akendb=# select * from tab01;
id | num_col01 
----+-----------
1 |       100
3 |        45
3 |        11
(3 rows)
akendb=# select * from tab02;
id | num_col01 
----+-----------
1 |        73
2 |        58
2 |        22
(3 rows)
akendb=#

--两表数据union all：

(select * from tab01 )

union all

(select * from tab02)

akendb=# (select * from tab01 ) 
akendb-# union all 
akendb-# (select * from tab02) ;
id | num_col01 
----+-----------
1 |       100
3 |        45
3 |        11
1 |        73
2 |        58
2 |        22
(6 rows)
akendb=#

--测试语句：业务期望是想得出两表合并后根据id排序前面的两条记录，即tab01中(id=1,num_col01=100)和tab02中的(id=1,num_col01=73)。

(select * from tab01 )

union all

(select * from tab02)

order by id asc limit 2;

akendb=# (select * from tab01 ) 
akendb-# union all 
akendb-# (select * from tab02) 
akendb-# order by  id asc  limit  2;
 id | num_col01 
----+-----------
  1 |       100
  1 |        73
(2 rows)
akendb=#

接着我们对调一下量表的先后顺序，除了记录的顺序不一样，得到的结果是相同的两行数据。

(select * from tab02 )

union all

(select * from tab01)

order by id asc limit 2;

akendb=# (select * from tab02 ) 人更美akendb-# union all akendb-# (select * from tab01) akendb-# order by  id asc  limit  2; id | num_col01 ----+-----------  1 |        73  1 |       100(2 rows)akendb=#

依样画葫芦，我们将order by和limit内推到union all各个字句中，查询结果和上面的一致，好像没什么问题。

(select * from tab01 order by id limit 2)

union all

(select * from tab02 order by id limit 2)

order by id asc limit 2;

(select * from tab02 order by id limit 2)

union all

(select * from tab01 order by id limit 2)

order by id asc limit 2;

akendb=# (select * from tab01 order by id limit 2) 
akendb-# union all 
akendb-# (select * from tab02 order by id limit 2)   
akendb-# order by  id asc  limit  2;
 id | num_col01 
----+-----------
  1 |       100
  1 |        73
(2 rows)
akendb=# (select * from tab02 order by id limit 2) 
akendb-# union all 
akendb-# (select * from tab01 order by id limit 2)   
akendb-# order by  id asc  limit  2;
 id | num_col01 
----+-----------
  1 |        73
  1 |       100
(2 rows)
akendb=#

细心的同学或许已经看出了其中的问题，为了让问题更加明显，我这里将order by字段的值改动一下：让每个表或第一张表的order by字段重复值的数量大于limit n中的n：

update tab01 set id=1;

update tab02 set id=2;

akendb=# update tab01 set id=1;   --进行update之后id=1的行数为3，大于limit 2取数范围。
UPDATE 3
akendb=# update tab02 set id=1;
UPDATE 3
akendb=#

首先，直接执行原来的查询语句：

(select * from tab01 )

union all

(select * from tab02)

order by id asc limit 2;

(select * from tab02 )

union all

(select * from tab01)

order by id asc limit 2;

akendb=# (select * from tab01 ) 
akendb-# union all 
akendb-# (select * from tab02) 
akendb-# order by  id asc  limit  2;
 id | num_col01 
----+-----------
  1 |        45
  1 |       100
(2 rows)
akendb=# (select * from tab02 ) 
akendb-# union all 
akendb-# (select * from tab01) 
akendb-# order by  id asc  limit  2;
 id | num_col01 
----+-----------
  1 |        58
  1 |        73
(2 rows)
akendb=#

上面执行结果和之前的查询结果不太一样了，原因是union all中做order by实际上根据第一个表的字段在排序，然后limit的取值实际上倾向于能取第一个结果集的limit n，所以到底是tab01还是tab02在前，其实在union all的应用场景里面是有所区别的。

akendb=#  explain analyze (select * from tab01 ) union all (select * from tab02) order by  id asc  limit  2;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Limit  (cost=110.40..110.41 rows=2 width=8) (actual time=0.031..0.033 rows=2 loops=1)
   ->  Sort  (cost=110.40..121.70 rows=4520 width=8) (actual time=0.030..0.030 rows=2 loops=1)
         Sort Key: tab01.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Append  (cost=0.00..65.20 rows=4520 width=8) (actual time=0.012..0.018 rows=6 loops=1)
               ->  Seq Scan on tab01  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.012..0.013 rows=3 loops=1)
               ->  Seq Scan on tab02  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.004..0.004 rows=3 loops=1)
 Planning time: 0.118 ms
 Execution time: 0.062 ms
(9 rows)


akendb=#  explain analyze (select * from tab02 ) union all (select * from tab01) order by  id asc  limit  2;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Limit  (cost=110.40..110.41 rows=2 width=8) (actual time=0.029..0.031 rows=2 loops=1)
   ->  Sort  (cost=110.40..121.70 rows=4520 width=8) (actual time=0.028..0.028 rows=2 loops=1)
         Sort Key: tab02.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Append  (cost=0.00..65.20 rows=4520 width=8) (actual time=0.010..0.016 rows=6 loops=1)
               ->  Seq Scan on tab02  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.010..0.011 rows=3 loops=1)
               ->  Seq Scan on tab01  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.003..0.004 rows=3 loops=1)
 Planning time: 0.110 ms
 Execution time: 0.058 ms
(9 rows)
akendb=#

接着，我们先看看“优化”改写后的效果：

(select * from tab01 order by id limit 2)

union all

(select * from tab02 order by id limit 2)

order by id asc limit 2;

(selt * from tab02 order by id limit 2)

union all

(select * from tab01 order by id limit 2)

order by id asc limit 2;

akendb=# (select * from tab01 order by id limit 2) 
akendb-# union all 
akendb-# (select * from tab02 order by id limit 2)   
akendb-# order by  id asc  limit  2;
 id | num_col01 
----+-----------
  1 |       100
  1 |        45
(2 rows) 
akendb=# (select * from tab02 order by id limit 2) 
akendb-# union all 
akendb-# (select * from tab01 order by id limit 2)   
akendb-# order by  id asc  limit  2;
 id | num_col01 
----+-----------
  1 |        73
  1 |        58
(2 rows)
akendb=#

上面的输出结果，看似和原句的结果保持一致，事实上，当第一张表的中order by字段的记录满足比第二张表的记录都小的时候，那么这种内推到子句的改法是明显错误的，比如下面这种情况：

akendb=# select * from tab01;
 id | num_col01 
----+-----------
  1 |        45
  1 |         6
  1 |        65
(3 rows)

akendb=# select * from tab02;
 id | num_col01 
----+-----------
  2 |        13
  2 |        16
  2 |        82
(3 rows)

akendb=# (select * from tab01 order by id limit 2) 
union all 
(select * from tab02 order by id limit 2)   
order by  id asc  limit  3;
 id | num_col01 
----+-----------
  1 |        45
  1 |         6
  2 |        13
(3 rows)

akendb=#

这里的查询结果就不应该出现第二张表的记录，但因为内推改变了取数据的逻辑，这是明显搞错了。

所以，在union all之后执行order by取top n看似逻辑的理解上正确了，但在实际查询结果却偏差了，不少开发和运维的同学可能并没有觉察，以前遇到过，今儿再次遇到了，简单记录一下。


I Love PG

关于我们
中国开源软件推进联盟PostgreSQL分会（简称：中国PG分会）于2017年成立，由国内多家PostgreSQL生态企业所共同发起，业务上接受工信部中国电子信息产业发展研究院指导。中国PG分会是一个非盈利行业协会组织。我们致力于在中国构建PostgreSQL产业生态，推动PostgreSQL产学研用发展。

欢迎投稿

做你的舞台，show出自己的才华。

投稿邮箱：partner@postgresqlchina.com

——愿能安放你不羁的灵魂

技术文章精彩回顾




PostgreSQL学习的九层宝塔
PostgreSQL职业发展与学习攻略
2019，年度数据库舍 PostgreSQL 其谁?
Postgres是最好的开源软件
PostgreSQL是世界上最好的数据库
从Oracle迁移到PostgreSQL的十大理由
从“非主流”到“潮流”，开源早已值得拥有

PG活动精彩回顾




创建PG全球生态！PostgresConf.CN2019大会盛大召开
首站起航！2019“让PG‘象’前行”上海站成功举行
走进蓉城丨2019“让PG‘象’前行”成都站成功举行
中国PG象牙塔计划发布，首批合作高校授牌仪式在天津举行
群英论道聚北京，共话PostgreSQL
相聚巴厘岛| PG Conf.Asia 2019  DAY0、DAY1简报
相知巴厘岛| PG Conf.Asia 2019 DAY2简报
独家|硅谷Postgres大会简报
直播回顾 | Bruce Momjian：原生分布式将在PG 14版本发布

PG培训认证精彩回顾




中国首批PGCA认证考试圆满结束，203位考生成功获得认证！
中国第二批PGCA认证考试圆满结束，115位考生喜获认证！
重要通知：三方共建，中国PostgreSQL认证权威升级！
近500人参与！首次PGCE中级、第三批次PGCA初级认证考试落幕！
2020年首批 | 中国PostgreSQL初级认证考试圆满结束
一分耕耘一分收获，第五批次PostgreSQL认证考试成绩公布

postgresql

文章转载自开源软件联盟PostgreSQL分会，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL简单验证，似是而非union all后order by取limit

评论