PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time

digoal 2020-05-09

922

作者

digoal

日期

2020-05-09

背景

随机采样, 用于在资源池中随机提取一些记录, 返回给不同的客户端.

内置采样方法system, 如下, 最多访问随机10%的数据块, 找到符合条件的记录, 满10条limit即停止扫描. 也就是说最多会扫描10%的video table数据块. 即使没有符合条件的10条结果, 也返回.

```
explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system(10) where play_count>=2000 limit 10;
QUERY PLAN

Limit (cost=0.00..1.73 rows=10 width=12) (actual time=0.006..0.011 rows=10 loops=1)
Output: id, play_count
Buffers: shared hit=1
-> Sample Scan on public.video v1 (cost=0.00..3856.95 rows=22351 width=12) (actual time=0.006..0.009 rows=10 loops=1)
Output: id, play_count
Sampling: system ('10'::real)
Filter: (v1.play_count >= 2000)
Buffers: shared hit=1
Planning Time: 0.151 ms
Execution Time: 0.024 ms
(10 rows)
```

有没有更精细化的采样阈值控制呢, 以上如果调用频率很高, 并且采样的比例很高, 并且符合条件的记录很少, 那么就可能耗费较多资源, 导致雪崩.

如何优化? 如下:

块级别随机采样method扩展插件

```
create extension tsm_system_rows ;
CREATE EXTENSION

create extension tsm_system_time ;
CREATE EXTENSION
```

最多采样10毫秒, 返回符合play_count>=2000的10条. (如果很快就有10条符合条件, 那么不会继续扫描, 所以很快很快)

```
explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system_time(10) where play_count>=2000 limit 10;
QUERY PLAN

Limit (cost=0.00..1.73 rows=10 width=12) (actual time=0.009..0.030 rows=10 loops=1)
Output: id, play_count
Buffers: shared hit=3
-> Sample Scan on public.video v1 (cost=0.00..10.71 rows=62 width=12) (actual time=0.008..0.028 rows=10 loops=1)
Output: id, play_count
Sampling: system_time ('10'::double precision)
Filter: (v1.play_count >= 2000)
Buffers: shared hit=3
Planning Time: 0.052 ms
Execution Time: 0.042 ms
(10 rows)
```

最多采样100条, 返回符合play_count>=2000的10条. (如果很快就有10条符合条件, 那么不会继续扫描, 所以很快很快)

```
explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system_rows (100) where play_count>=2000 limit 10;
QUERY PLAN

Limit (cost=0.00..1.75 rows=10 width=12) (actual time=0.023..0.031 rows=10 loops=1)
Output: id, play_count
Buffers: shared hit=1
-> Sample Scan on public.video v1 (cost=0.00..16.65 rows=95 width=12) (actual time=0.022..0.028 rows=10 loops=1)
Output: id, play_count
Sampling: system_rows ('100'::bigint)
Filter: (v1.play_count >= 2000)
Buffers: shared hit=1
Planning Time: 0.076 ms
Execution Time: 0.051 ms
(10 rows)
```

在时间可控,代价可控的情况下(防止采样雪崩),保证采样随机性.
如果where条件的记录占比很少很少, 可能达到采样上限后无法返回limit的条数. 这种情况下, 可以选择几种方法:
1、调大采样上限, 注意防止雪崩
2、修改where条件, 使得覆盖率变大.
3、垃圾回收, 使得每个block的空洞变少.
4、分区, 提高目标采样表中 where 条件符合条件记录的占比.

参考

https://www.postgresql.org/docs/12/tsm-system-rows.html

https://www.postgresql.org/docs/12/tsm-system-time.html

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time

作者

日期

标签

背景

参考

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论