暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

沉浸式学习PostgreSQL|PolarDB 14: 共享单车、徒步、旅游、网约车轨迹查询

Aug_33 2023-11-15
332

背景

欢迎数据库应用开发者参与贡献场景, 在此issue回复即可, 共同建设《沉浸式数据库学习教学素材库》, 帮助开发者用好数据库, 提升开发者职业竞争力, 同时为企业降本提效.

  • 系列课程的核心目标是教大家怎么用好数据库, 而不是怎么运维管理数据库、怎么开发数据库内核. 所以面向的对象是数据库的用户、应用开发者、应用架构师、数据库厂商的产品经理、售前售后专家等角色.

本文的实验可以使用永久免费的阿里云云起实验室来完成.

如果你本地有docker环境也可以把镜像拉到本地来做实验:

x86_64机器使用以下docker image:

ARM机器使用以下docker image:

业务场景1 介绍: 共享单车、徒步、旅游、网约车轨迹查询

租用共享单车, 从开始租用到还车, 这段用户骑行的过程就是一段轨迹, 例如每5秒记录一个点(x,y,z).

很多小程序或健康APP通过手机或手环的定位来记录轨迹.

专业的驴友, 通常都有非常专业的带GPS的手表也会记录徒步的轨迹.

网约车就更不用说了, 行业要求必须记录每一单的行驶信息, 其中也包含了轨迹数据.

当用户量大了之后, 查询轨迹、分析轨迹可能是一个比较高频的操作.

这个实验的目的是帮助你了解如何设计轨迹表, 如何高性能的写入、查询、分析轨迹数据.

实现和对照

轨迹的生成的特点:

  • 1、平台上使用软件的人很多
  • 2、可能很多人在同时产生运动轨迹中的点(例如高峰期很多人同时在使用共享单车、网约车. 同一个时刻可能很多人在打点).

基于这两个特点, 如果每生成1条轨迹就往数据库中写入一条记录, 会导致一个轨迹若干个点的物理存储是非常离散的.

即使一次写入一个轨迹的多条记录, 也可能出现一个轨迹若干个点的物理存储是非常离散的情况.

传统方法 设计和实验

轨迹中的每个点存储一条记录, 最终拼成一条轨迹.

设计2个表:

1、存储每个轨迹的唯一轨迹 ID

drop table if exists tbl_path_id;  
  
create table tbl_path_id (  
  id int primary key,   -- 轨迹ID  
  uid int,  -- 这段轨迹对应的用户  
  ts_begin timestamp,   -- 轨迹开始时间  
  ts_end timestamp,   -- 轨迹结束时间  
  pos_begin point,  -- 为了简化例子, 使用PG|PolarDB内置的平面点. 真实场景通常是用PostGIS插件的geometry类型, 包括经纬度和海平面高度.   
  pos_end point  -- 轨迹结束位置  
);

2、存储每个对象的一段运动轨迹(例如单车骑行的一单轨迹、徒步旅游开启的轨迹记录、...)

drop table if exists tbl_path_detail;  
  
create unlogged table tbl_path_detail (  -- 方便测试, 使用unlogged table.  
  pid int,  -- 轨迹ID  
  pos point,  -- 位置  
  ts timestamp  -- 时间  
);

3、创建轨迹索引, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.

create index on tbl_path_detail (pid,ts);

4、为了方便生成测试数据, 先模拟写入1000条轨迹, 平均每条轨迹1万个点, 形成1000万条轨迹点, 再统计到轨迹唯一标识表.

vi t1.sql  
  
\set pid random(1,1000)  
insert into tbl_path_detail values (:pid, point(random(), random()), now());
pgbench -M prepared -n -r -P 1 -f ./t1.sql -c 10 -j 10 -t 1000000  
  
transaction type: ./t1.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
number of transactions per client: 1000000  
number of transactions actually processed: 10000000/10000000  
latency average = 0.053 ms  
latency stddev = 0.102 ms  
initial connection time = 23.060 ms  
tps = 189004.207706 (without initial connection time)  
statement latencies in milliseconds:  
         0.000  \set pid random(1,1000)  
         0.052  insert into tbl_path_detail values (:pid, point(random(), random()), now());
postgres=# select * from tbl_path_detail limit 10;  
 pid |                   pos                    |             ts               
-----+------------------------------------------+----------------------------  
 160 | (0.5000283130032592,0.15678314824874917) | 2023-09-08 02:05:15.442562  
  43 | (0.7333981781385361,0.6660386524481545)  | 2023-09-08 02:05:15.442563  
 602 | (0.5665795071743318,0.7419887321824241)  | 2023-09-08 02:05:15.44323  
 738 | (0.4883731035056087,0.5410191566444489)  | 2023-09-08 02:05:15.444977  
 925 | (0.47646978937058293,0.3324764895717003) | 2023-09-08 02:05:15.445011  
 934 | (0.8969771676532297,0.11898616861406452) | 2023-09-08 02:05:15.445049  
 748 | (0.0461507467234199,0.3288353473172556)  | 2023-09-08 02:05:15.445128  
 330 | (0.05559804559846171,0.7381967670381471) | 2023-09-08 02:05:15.445196  
 978 | (0.21187515313173577,0.6163791267053291) | 2023-09-08 02:05:15.445191  
 626 | (0.7069111074590744,0.4406223155616047)  | 2023-09-08 02:05:15.445312  
(10 rows)

5、将轨迹点汇总到轨迹唯一标识表

insert into tbl_path_id (id,uid,ts_begin,ts_end,pos_begin,pos_end)   
select pid, random()*100, ts_begin, ts_end, pos_begin, pos_end from   
  (select pid,   
    row_number() over w as rn,   
    first_value(ts) over w as ts_begin,   
    last_value(ts) over w as ts_end,   
    first_value(pos) over w as pos_begin,   
    last_value(pos) over w as pos_end   
  from tbl_path_detail   
    window w as (partition by pid order by ts RANGE BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) ) t   
where rn=1;
postgres=# select * from tbl_path_id limit 10;  
 id | uid |          ts_begin          |           ts_end           |                 pos_begin                 |                  pos_end                    
----+-----+----------------------------+----------------------------+-------------------------------------------+-------------------------------------------  
  1 |   2 | 2023-09-08 02:05:15.458977 | 2023-09-08 02:06:08.348456 | (0.9431799157631602,0.41223770583725994)  | (0.43968362954398543,0.8294152062281377)  
  2 |  15 | 2023-09-08 02:05:15.455308 | 2023-09-08 02:06:08.348047 | (0.20157767943472393,0.7583908472256411)  | (0.8633428320677936,0.020082269279203757)  
  3 |  60 | 2023-09-08 02:05:15.450998 | 2023-09-08 02:06:08.273702 | (0.8676802874185334,0.9332642681167798)   | (0.7358999901381367,0.8543637573444975)  
  4 |  28 | 2023-09-08 02:05:15.455818 | 2023-09-08 02:06:08.329148 | (0.5804204502817427,0.31601818657295055)  | (0.0174753704646875,0.18107497337433998)  
  5 |  97 | 2023-09-08 02:05:15.458717 | 2023-09-08 02:06:08.3295   | (0.24820536688948636,0.802131631221048)   | (0.8229560445619342,0.5435402893223191)  
  6 |  33 | 2023-09-08 02:05:15.450005 | 2023-09-08 02:06:08.312634 | (0.527644965371401,0.12866950317216563)   | (0.697993640371152,0.6356491117241596)  
  7 |   3 | 2023-09-08 02:05:15.462201 | 2023-09-08 02:06:08.345394 | (0.23643719560948284,0.6980441144259366)  | (0.7981387104508499,0.4564650368837313)  
  8 |  38 | 2023-09-08 02:05:15.457057 | 2023-09-08 02:06:08.275626 | (0.17698975732867694,0.33573573990590333) | (0.6194784648490028,0.8182679927624612)  
  9 |  18 | 2023-09-08 02:05:15.454722 | 2023-09-08 02:06:08.350231 | (0.4388676699772063,0.7529218330600749)   | (0.2080260827608882,0.04568571675149613)  
 10 |  40 | 2023-09-08 02:05:15.452294 | 2023-09-08 02:06:08.326824 | (0.6334737638364523,0.07649313466173169)  | (0.8915125815199474,0.8221032349776571)  
(10 rows)

6、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.

vi t2.sql  
  
\set pid random(1,1000)  
select * from tbl_path_detail where pid=:pid order by ts;
pgbench -M prepared -n -r -P 1 -f ./t2.sql -c 10 -j 10 -T 120  
  
  
transaction type: ./t2.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 15979  
latency average = 75.097 ms  
latency stddev = 9.750 ms  
initial connection time = 20.363 ms  
tps = 133.126750 (without initial connection time)  
statement latencies in milliseconds:  
         0.001  \set pid random(1,1000)  
        75.097  select * from tbl_path_detail where pid=:pid order by ts;

PolarDB|PG新方法1 设计和实验

在传统方法中, 平均一个轨迹有1万个点, 由于不同的轨迹点是并行写入的, 所以这1万个点非常可能分散在1万个数据块中, 导致查询1条轨迹要访问很多的数据块. 使得性能瓶颈在IO层面(如下这条轨迹消耗9538个IO).

postgres=# explain (analyze,verbose,timing,buffers,costs,settings,wal) select * from tbl_path_detail where pid=1 order by ts;  
                                                                          QUERY PLAN                                                                             
---------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using tbl_path_detail_pid_ts_idx on public.tbl_path_detail  (cost=0.43..10473.93 rows=9955 width=28) (actual time=0.060..24.022 rows=9937 loops=1)  
   Output: pid, pos, ts  
   Index Cond: (tbl_path_detail.pid = 1)  
   Buffers: shared hit=9538  
 Settings: enable_seqscan = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1'  
 Planning Time: 0.183 ms  
 Execution Time: 25.039 ms  
(7 rows)

PG|PolarDB支持include index, 为了解决这个问题, 可以将要查询的字段放入索引的叶子结点内, 避免回表导致的IO.

1、创建include index, 把pos放入索引叶子结点:

create index on tbl_path_detail (pid,ts) INCLUDE (pos);  
  
vacuum analyze tbl_path_detail;

2、此时使用新的index查询, 不需要回表, IO大幅度降低, 如下这条轨迹消耗66个IO:

postgres=# explain (analyze,verbose,timing,buffers,costs,settings,wal) select * from tbl_path_detail where pid=1 order by ts;  
                                                                             QUERY PLAN                                                                                
---------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using tbl_path_detail_pid_ts_pos_idx on public.tbl_path_detail  (cost=0.56..241.93 rows=9958 width=28) (actual time=0.082..4.217 rows=9937 loops=1)  
   Output: pid, pos, ts  
   Index Cond: (tbl_path_detail.pid = 1)  
   Heap Fetches: 0  
   Buffers: shared hit=66  
 Settings: enable_seqscan = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1'  
 Planning Time: 0.536 ms  
 Execution Time: 5.598 ms  
(8 rows)

3、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.

vi t2.sql  
  
\set pid random(1,1000)  
select * from tbl_path_detail where pid=:pid order by ts;
pgbench -M prepared -n -r -P 1 -f ./t2.sql -c 10 -j 10 -T 120  
  
transaction type: ./t2.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 113515  
latency average = 10.570 ms  
latency stddev = 2.942 ms  
initial connection time = 18.558 ms  
tps = 945.998467 (without initial connection time)  
statement latencies in milliseconds:  
         0.000  \set pid random(1,1000)  
        10.570  select * from tbl_path_detail where pid=:pid order by ts;

PolarDB|PG新方法2 设计和实验

虽然使用了include index, 但是索引里面就要放所有要查的字段内容, 导致索引占用空间会变大一点.

为了更好的优化这个能力, 可以使用聚集字段来存储一条轨迹的内容, 例如array.

1、创建一张轨迹表, 使用array存储轨迹中每个点的时间、位置.

drop table if exists tbl_path_detail1;  
  
create unlogged table tbl_path_detail1 (  -- 方便测试, 使用unlogged table.  
  pid int,  -- 轨迹ID  
  pos point[],  -- 位置数组  
  ts timestamp[]  -- 时间数组  
);  
  
create index on tbl_path_detail1 (pid);

2、将原始轨迹写入新的轨迹表.

insert into tbl_path_detail1 select pid, array_agg(pos order by ts), array_agg(ts order by ts) from tbl_path_detail group by pid;
postgres=# select pid, pos[1:10], ts[1:10] from tbl_path_detail1 where pid=1;  
-[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
pid | 1  
pos | {"(0.9431799157631602,0.41223770583725994)","(0.029285214711183727,0.13121257852994006)","(0.47231141497136164,0.40519912875600994)","(0.18841309050989707,0.9120792659923715)","(0.1669935448355666,0.788043626909726)","(0.7680464170226315,0.02176761079732259)","(0.1806378616460016,0.17626971610524578)","(0.6905476714298793,0.21249186817883725)","(0.20480152830909404,0.36856166073248176)","(0.37764392812544045,0.5214349561346836)"}  
ts  | {"2023-09-08 02:05:15.458977","2023-09-08 02:05:15.460392","2023-09-08 02:05:15.46647","2023-09-08 02:05:15.466755","2023-09-08 02:05:15.46993","2023-09-08 02:05:15.470127","2023-09-08 02:05:15.473513","2023-09-08 02:05:15.474446","2023-09-08 02:05:15.480445","2023-09-08 02:05:15.484117"}

应用层面, 未来在写入轨迹时也可以使用这个方法, 等轨迹完成一次性上传并写入新的轨迹表, 老的轨迹表就不需要了.

3、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.

vi t3.sql  
  
\set pid random(1,1000)  
select * from tbl_path_detail1 where pid=:pid;
pgbench -M prepared -n -r -P 1 -f ./t3.sql -c 10 -j 10 -T 120  
  
transaction type: ./t3.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 65173  
latency average = 18.410 ms  
latency stddev = 5.716 ms  
initial connection time = 23.213 ms  
tps = 543.135449 (without initial connection time)  
statement latencies in milliseconds:  
         0.000  \set pid random(1,1000)  
        18.410  select * from tbl_path_detail1 where pid=:pid;

或者也可以使用text存储轨迹点.

4、创建一张轨迹表, 使用array存储轨迹中每个点的时间、位置.

drop table if exists tbl_path_detail2;  
  
create unlogged table tbl_path_detail2 (  -- 方便测试, 使用unlogged table.  
  pid int,  -- 轨迹ID  
  pos text,  -- 位置字符串  
  ts text  -- 时间字符串  
);  
  
create index on tbl_path_detail2 (pid);

5、将原始轨迹写入新的轨迹表.

insert into tbl_path_detail2 select pid, string_agg(pos::text, ',' order by ts), string_agg(ts::text, ',' order by ts) from tbl_path_detail group by pid;

应用层面, 未来在写入轨迹时也可以使用这个方法, 等轨迹完成一次性上传并写入新的轨迹表, 老的轨迹表就不需要了.

6、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.

vi t4.sql  
  
\set pid random(1,1000)  
select * from tbl_path_detail2 where pid=:pid;
pgbench -M prepared -n -r -P 1 -f ./t4.sql -c 10 -j 10 -T 120  
  
transaction type: ./t4.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 246670  
latency average = 4.864 ms  
latency stddev = 1.974 ms  
initial connection time = 21.508 ms  
tps = 2055.793315 (without initial connection time)  
statement latencies in milliseconds:  
         0.000  \set pid random(1,1000)  
         4.864  select * from tbl_path_detail2 where pid=:pid;

对比轨迹表的空间和索引占用

postgres=# \dt+  
                                         List of relations  
 Schema |       Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description   
--------+------------------+-------+----------+-------------+---------------+---------+-------------  
 public | tbl_path_detail  | table | postgres | unlogged    | heap          | 575 MB  |   
 public | tbl_path_detail1 | table | postgres | unlogged    | heap          | 202 MB  |   
 public | tbl_path_detail2 | table | postgres | unlogged    | heap          | 346 MB  |   
  
  
postgres=# \di+  
                                                          List of relations  
 Schema |              Name              | Type  |  Owner   |      Table       | Persistence | Access method |  Size   | Description   
--------+--------------------------------+-------+----------+------------------+-------------+---------------+---------+-------------  
 public | tbl_path_detail1_pid_idx       | index | postgres | tbl_path_detail1 | unlogged    | btree         | 48 kB   |   
 public | tbl_path_detail2_pid_idx       | index | postgres | tbl_path_detail2 | unlogged    | btree         | 48 kB   |   
 public | tbl_path_detail_pid_ts_idx     | index | postgres | tbl_path_detail  | unlogged    | btree         | 301 MB  |   
 public | tbl_path_detail_pid_ts_pos_idx | index | postgres | tbl_path_detail  | unlogged    | btree         | 473 MB  |

对照

聚集类型既能节省轨迹存储、索引空间, 又能提升轨迹查询的性能.

思考: 如果要按时间范围、空间范围搜索用户过去的轨迹, 然后再查询对应的轨迹点, 应该如何设计表结构? 如何使用索引?

1、空间对比

轨迹存储方式轨迹占用空间轨迹索引占用空间
传统存储(每个点一条记录,索引不包含位置)575 MB301 MB
传统存储(每个点一条记录,索引包含位置)575 MB473 MB
数组存储轨迹202 MB48 kB
字符串存储轨迹346 MB48 kB

2、性能对比

轨迹存储方式轨迹查询QPS
传统存储(每个点一条记录,索引不包含位置)133
传统存储(每个点一条记录,索引包含位置)946
数组存储轨迹543
字符串存储轨迹2056

知识点

array

include index

window function

思考

轨迹分析应用: 轨迹碰撞, 例如刑侦场景

如果按用户的时间范围查询轨迹, 使用什么数据类型、索引性能更好? range type + gist index ?

如何按用户的空间范围查询轨迹, 使用什么数据类型、索引性能更好? range type + gist index ?

如果一次不想取出所有的点, 如何从字符串存储的轨迹点中获取某个时间对应的点, 或者某一段点?

参考

202105/20210530_02.md 《重新发现PostgreSQL之美 - 8 轨迹业务IO杀手克星index include(覆盖索引)》
202104/20210406_04.md 《PostgreSQL 14 preview - SP-GiST 索引新增 index 叶子结点 include column value 功能 支持》
202004/20200429_01.md 《PostgreSQL 索引算子下推扩展 - 索引内offset - 索引内过滤 - include index - 随机偏移》
201905/20190503_03.md 《PostgreSQL index include - 类聚簇表与应用(append only, IoT时空轨迹, 离散多行扫描与返回)》
201903/20190331_08.md 《PostgreSQL 12 preview - GiST 索引支持INCLUDE columns - 覆盖索引 - 类聚簇索引》
201812/20181209_01.md 《PostgreSQL IoT,车联网 - 实时轨迹、行程实践 2 - (含index only scan类聚簇表效果)》
201812/20181207_01.md 《PostgreSQL IoT,车联网 - 实时轨迹、行程实践 1》
201707/20170722_02.md 《车联网案例,轨迹清洗 - 阿里云RDS PostgreSQL最佳实践 - 窗口函数》
202212/20221229_01.md 《PolarDB 开源版 轨迹应用实践 - 出行、配送、快递等业务的调度; 传染溯源; 刑侦》
202212/20221223_02.md 《使用 PolarDB 开源版 部署 PostGIS 支撑时空轨迹|地理信息|路由等业务》
202106/20210602_01.md 《重新发现PostgreSQL之美 - 11 时空轨迹系统 新冠&刑侦&预测》
202105/20210530_02.md 《重新发现PostgreSQL之美 - 8 轨迹业务IO杀手克星index include(覆盖索引)》
202105/20210509_01.md 《PostgreSQL 应用开发解决方案最佳实践系列课程 - 6. 时空、时态、时序、日志等轨迹系统》
202011/20201117_01.md 《使用Postgres,MobilityDB和Citus大规模(百亿级)实时分析GPS轨迹》
201905/20190503_03.md 《PostgreSQL index include - 类聚簇表与应用(append only, IoT时空轨迹, 离散多行扫描与返回)》
201812/20181209_01.md 《PostgreSQL IoT,车联网 - 实时轨迹、行程实践 2 - (含index only scan类聚簇表效果)》
201812/20181207_01.md 《PostgreSQL IoT,车联网 - 实时轨迹、行程实践 1》
201811/20181101_02.md 《PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合》
201806/20180607_02.md 《Greenplum 轨迹相似(伴随分析)》
201712/20171231_01.md 《PostgreSQL 实时位置跟踪+轨迹分析系统实践 - 单机顶千亿轨迹/天》
201712/20171204_01.md 《GIS术语 - POI、AOI、LOI、路径、轨迹》
201708/20170803_01.md 《菜鸟末端轨迹 - 电子围栏(解密支撑每天251亿个包裹的数据库) - 阿里云RDS PostgreSQL最佳实践》
201707/20170722_02.md 《车联网案例,轨迹清洗 - 阿里云RDS PostgreSQL最佳实践 - 窗口函数》
201704/20170418_01.md 《PostgreSQL 物流轨迹系统数据库需求分析与设计 - 包裹侠实时跟踪与召回》
201702/20170219_01.md 《PostgreSQL 聚集存储 与 BRIN索引 - 高并发行为、轨迹类大吞吐数据查询场景解说》
201606/20160611_02.md 《PostgreSQL 如何轻松搞定行驶、运动轨迹合并和切分》
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论