PostGIS查询优化之特殊的空间关系ST_Disjoint

Spatial Data 2021-05-08

3911

一前言

Spatial SQL查询较慢主要原因如下：

建立空间索引的图形列通过表达式、函数、类型转换后导致索引失效。
部分空间函数根本不会走空间索引。
索引效率低，IO与CPU放大严重。

本文主要阐述第二点，如何处理根本不走空间索引的空间关系ST_Disjoint（相离）。

二 ST_Disjoint为何特殊

空间关系

PostGIS中的空间关系主要就分2大类，相交与相离，相交大类根据强topo规则又可以继续细分更多更精确的空间关系。

PostGIS中的gist索引实际可以看成是Rtree索引的升级版，它工作的机制是，首先根据查询条件图形的外接矩形去索引中命中数据，然后再recheck数据，将命中的数据与输入图形一一对比计算，判定实际的空间关系是否成立，如下图：

空间索引查询机制

上图可以看出，空间索引其实都是针对相交及其细分的空间关系的，相离实际是相交的对立面，正如=与!=、in与not in的对立，在btree索引中，对索引字段进行"="查询一般是走索引的，"!="查询一般是不走索引的，那么对于空间索引gist而言，“相交”是走索引的，反义的“相离”不走索引。

三案例说明

场景：已知居民点house表数量25000，交通点station表数量6000，统计所有交通点200米以外的居民点数量。

创建测试数据：

--居民点
create table house(
  gid serial primary key,
  geom geometry(Point,4326)
);
create index house_geom_idx on house using gist(geom);
--区域内构造测试数据
insert into house(geom)
select (st_dump(
ST_GeneratePoints(
  ST_MakeEnvelope(113,24,125,35,4326),
  25000)
)).geom;


--交通点
create table station(
  gid serial primary key,
  geom geometry(Point,4326)
);
insert into station(geom)
select (st_dump(
ST_GeneratePoints(
  ST_MakeEnvelope(113,24,125,35,4326),
  6000)
)).geom;

业务逻辑：

1 station先缓冲200米；

2 合并所有小缓冲区为一个面；

3 计算与缓冲面不相交的居民点；

业务逻辑

业务sql：

explain with temp_t as 
(select st_union(st_buffer(geom::geography,200)::geometry) as geom 
from station) 
select count(a.*) from house a,temp_t b where 
ST_Disjoint(a.geom,b.geom);
                                                                                                                QUERY PLAN                                                                 
                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------
 Aggregate  (cost=1077439.85..1077439.86 rows=1 width=8)
   ->  Nested Loop  (cost=451710.00..1077419.02 rows=8333 width=60)
         Join Filter: st_disjoint(a.geom, (st_union((geography(st_transform(st_buffer(st_transform(geometry((station.geom)::geography), _st_bestsrid((station.geom)::geography)), '200'::do
uble precision, ''::text), 4326)))::geometry)))
         ->  Aggregate  (cost=451710.00..451710.01 rows=1 width=32)
               ->  Seq Scan on station  (cost=0.00..110.00 rows=6000 width=32)
         ->  Seq Scan on house a  (cost=0.00..459.00 rows=25000 width=92)
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true

很明显，顺序扫描，不会走空间索引，而且如果你不信邪的话，这么点数据查询会卡死很久很久。。。

四优化方案

既然ST_Disjoint函数根本不走空间索引，那么换个思路，假设先查询交通点200米范围内的居民点，然后从总的居民点记录扣除查询到的数据，剩下的数据就是满足业务要求的结果了。

初步重构方案：简单的相交取反，从卡死优化到29s完成。

explain analyse select count(a.*) from house a where gid not in (
select distinct(a.gid) from house a,
(select st_union(st_buffer(geom::geography,200)::geometry) geom from station) b 
where ST_intersects(a.geom,b.geom) );
                    QUERY PLAN                                                            
-------------------------------------------------------
 (actual time=29084.264..29084.268 rows=1 loops=1)
   ->  Seq Scan on house a  (cost=451773.12..452294.62 rows=12500 width=60) (actual time=29076.760..29083.094 rows=24992 loops=1)
         Filter: (NOT (hashed SubPlan 1))
         Rows Removed by Filter: 8
         SubPlan 1
           ->  Unique  (cost=451772.93..451773.06 rows=25 width=4) (actual time=29062.593..29062.600 rows=8 loops=1)
                 ->  Sort  (cost=451772.93..451773.00 rows=25 width=4) (actual time=29062.591..29062.595 rows=8 loops=1)
                       Sort Key: a_1.gid
                       Sort Method: quicksort  Memory: 25kB
                       ->  Nested Loop  (cost=451710.28..451772.35 rows=25 width=4) (actual time=5319.598..29062.563 rows=8 loops=1)
                             ->  Aggregate  (cost=451710.00..451710.01 rows=1 width=32) (actual time=1952.370..1952.371 rows=1 loops=1)
                                   ->  Seq Scan on station  (cost=0.00..110.00 rows=6000 width=32) (actual time=0.011..1.205 rows=6000 loops=1)
                             ->  Index Scan using house_geom_idx on house a_1  (cost=0.28..62.31 rows=2 width=36) (actual time=3367.208..27110.156 rows=8 loops=1)
                                   Index Cond: (geom && (st_union((geography(st_transform(st_buffer(st_transform(geometry((station.geom)::geography), _st_bestsrid((station.geom)::geograph
y)), '200'::double precision, ''::text), 4326)))::geometry)))
                                   Filter: st_intersects(geom, (st_union((geography(st_transform(st_buffer(st_transform(geometry((station.geom)::geography), _st_bestsrid((station.geom)::g
eography)), '200'::double precision, ''::text), 4326)))::geometry)))
                                   Rows Removed by Filter: 24990

结论：

1 ST_Intersect函数是走Index Scan using house_geom_idx空间索引的。

2 house数据总共25000，索引命中后，再Filter：Rows Removed by Filter: 24990，可以看到绝大部分索引命中的数据都被removed了，证明空间索引效果很差，严重的IO与CPU放大，问题应该出在ST_Union上，缓冲区合并后的图形的bbox如下图蓝色区域，可以看出这个bbox放大很严重：

3 查询从卡死到29s返回结果，有了很大的提升，但是查询时间仍然很久很难令人接受。

二次重构方案：查询的时候不使用ST_Union，从29s优化到858ms。

不合并缓冲区后，每个交通点缓冲区的bbox都是独立的，并且范围不大，那么索引命中效率就会非常高，有效抑制IO与CPU放大问题，如下图：

explain analyse select count(a.*) from house a where gid not in (
select distinct(a.gid) from house a,
(select st_buffer(geom::geography,200)::geometry geom from station) b 
where ST_intersects(a.geom,b.geom) );
                     QUERY PLAN                                                            
-------------------------------------------------------
(actual time=858.170..858.173 rows=1 loops=1)
   ->  Seq Scan on house a  (cost=1671767.45..1672288.95 rows=12500 width=60) (actual time=849.879..857.160 rows=24992 loops=1)
         Filter: (NOT (hashed SubPlan 1))
         Rows Removed by Filter: 8
         SubPlan 1
           ->  Unique  (cost=1670954.95..1671704.95 rows=25000 width=4) (actual time=501.950..501.956 rows=8 loops=1)
                 ->  Sort  (cost=1670954.95..1671329.95 rows=150000 width=4) (actual time=501.949..501.952 rows=8 loops=1)
                       Sort Key: a_1.gid
                       Sort Method: quicksort  Memory: 25kB
                       ->  Nested Loop  (cost=75.41..1656008.00 rows=150000 width=4) (actual time=102.677..501.871 rows=8 loops=1)
                             ->  Seq Scan on station  (cost=0.00..110.00 rows=6000 width=32) (actual time=0.015..1.282 rows=6000 loops=1)
                             ->  Index Scan using house_geom_idx on house a_1  (cost=75.41..275.96 rows=2 width=36) (actual time=0.019..0.019 rows=0 loops=6000)
                                   Index Cond: (geom && (geography(st_transform(st_buffer(st_transform(geometry((station.geom)::geography), _st_bestsrid((station.geom)::geography)), '200'
::double precision, ''::text), 4326)))::geometry)
                                   Filter: st_intersects(geom, (geography(st_transform(st_buffer(st_transform(geometry((station.geom)::geography), _st_bestsrid((station.geom)::geography))
, '200'::double precision, ''::text), 4326)))::geometry)
                                   Rows Removed by Filter: 0

结论：

1 ST_Intersects函数是走了house_geom_idx索引的，并且Filter后removed记录数量为0，证明索引效率非常的高，完全没有多余数据被索引命中。

2 not in 查询可以使用not exists或左连接优化。

三次重构方案：从858ms到645ms。

not exists优化

xplain analyse SELECT count(*) from house a 
WHERE NOT EXISTS (
    SELECT * 
    FROM (select distinct(a.gid) gid from house a,
(select st_buffer(geom::geography,200)::geometry geom from station) b 
where ST_intersects(a.geom,b.geom) ) b
    WHERE b.gid= a.gid
);
               QUERY PLAN                                                                                                       
---------------------------------------------
(actual time=644.843..644.848 rows=1 loops=1)

左连接优化：

explain analyse select count(*) from (
select a.*,b.gid as gid2 from house a left join (
select distinct(a.gid) from house a,
(select st_buffer(geom::geography,200)::geometry geom from station) b 
where ST_intersects(a.geom,b.geom) ) as b on a.gid=b.gid) as c 
where gid2 is null;
                QUERY PLAN                                                                                                          
------------------------------------------------
(actual time=647.435..647.439 rows=1 loops=1)

结论：左连接和not exists是比not in性能更好。

五总结

ST_Disjoint空间关系是不会走空间索引的，该函数绝对不要在空间查询的条件中使用。
ST_Disjoint查询场景应该使用ST_Intersects查询取反去获取结果。
不要在查询过程中，查询条件中使用ST_Union函数，只能在最终的查询结果输出时可以使用合并函数，因为在过程中、条件中使用，一定会导致IO与CPU放大，极其影响性能。
精益求精，对任何可能存在优化空间的sql语句，都应该尝试测试下输出性能。

数据库

文章转载自Spatial Data，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostGIS查询优化之特殊的空间关系ST_Disjoint

评论