PostgreSQL 随机查询采样 - 既要真随机、又要高性能 - table sample方法

digoal 2021-01-05

4214

作者

digoal

日期

2021-05-27

背景

满足条件的结果有N条, 随机返回M条, 什么场景的需求?
推荐?

1、BERNOULLI , 逐条扫描全表, 采样返回总记录数的N%.
如果采用 limit N 截取N条返回, 效率会提高, 但是你会发现数据不够随机, 基本上集中在数据文件头部的blocks.

```
postgres=# select ctid,* from tbl TABLESAMPLE BERNOULLI (1) limit 10;
ctid | id | loc | beginid | endid
---------+------+------+-----------+-----------
(1,6) | 76 | 2006 | 100034889 | 100035096
(3,8) | 218 | 3280 | 100105401 | 100105571
(4,19) | 299 | 708 | 100145769 | 100146449
(6,25) | 445 | 3195 | 100220431 | 100221192
(7,3) | 493 | 1867 | 100247252 | 100248048
(9,5) | 635 | 2125 | 100318350 | 100319087
(10,51) | 751 | 1151 | 100374936 | 100375883
(12,20) | 860 | 2302 | 100430532 | 100430674
(12,33) | 873 | 15 | 100438908 | 100439548
(17,15) | 1205 | 2540 | 100607913 | 100608198
(10 rows)

Time: 0.387 ms
postgres=# select ctid,* from tbl TABLESAMPLE BERNOULLI (1) limit 10;
ctid | id | loc | beginid | endid
---------+-----+------+-----------+-----------
(1,18) | 88 | 1195 | 100038559 | 100039159
(2,20) | 160 | 4137 | 100076107 | 100076919
(4,62) | 342 | 2480 | 100169826 | 100170454
(6,2) | 422 | 392 | 100209331 | 100209964
(6,4) | 424 | 4239 | 100210066 | 100211053
(8,52) | 612 | 3086 | 100304147 | 100304853
(10,9) | 709 | 2137 | 100354108 | 100354314
(10,45) | 745 | 1808 | 100371819 | 100372439
(13,19) | 929 | 3888 | 100471124 | 100471669
(13,27) | 937 | 609 | 100476280 | 100476579
(10 rows)

Time: 0.301 ms
postgres=# select ctid,* from tbl TABLESAMPLE BERNOULLI (1) limit 10;
ctid | id | loc | beginid | endid
---------+------+------+-----------+-----------
(0,38) | 38 | 486 | 100015564 | 100015931
(0,51) | 51 | 1244 | 100022368 | 100022860
(1,29) | 99 | 4326 | 100044116 | 100044999
(1,63) | 133 | 449 | 100061611 | 100062197
(4,10) | 290 | 755 | 100141357 | 100142271
(5,13) | 363 | 960 | 100178662 | 100179447
(5,64) | 414 | 3800 | 100204948 | 100205770
(11,22) | 792 | 831 | 100395860 | 100396113
(14,17) | 997 | 2487 | 100504353 | 100504505
(16,41) | 1161 | 4053 | 100587279 | 100587984
(10 rows)

Time: 0.507 ms
```

2、SYSTEM , 逐BLOCKS扫描全表, 采样返回block总数的N%. 效率比BERNOULLI高很多, 但是每个被采样的block里面的所有记录都会被返回.
如果采用 limit N 截取N条返回, 你会发现数据不够随机, 每个被采样的block里面的所有记录都会被返回.

```
postgres=# select ctid,* from tbl TABLESAMPLE system (5) limit 10;
ctid | id | loc | beginid | endid
---------+-----+------+-----------+-----------
(10,1) | 701 | 2675 | 100348960 | 100349937
(10,2) | 702 | 4307 | 100349937 | 100350353
(10,3) | 703 | 475 | 100350353 | 100351093
(10,4) | 704 | 1611 | 100351093 | 100351171
(10,5) | 705 | 4307 | 100351171 | 100351692
(10,6) | 706 | 2841 | 100351692 | 100352448
(10,7) | 707 | 3680 | 100352448 | 100353372
(10,8) | 708 | 1085 | 100353372 | 100354108
(10,9) | 709 | 2137 | 100354108 | 100354314
(10,10) | 710 | 3381 | 100354314 | 100354905
(10 rows)

Time: 0.547 ms

postgres=# select ctid,* from tbl TABLESAMPLE system (5) limit 10;
ctid | id | loc | beginid | endid
---------+------+------+-----------+-----------
(30,1) | 2101 | 2279 | 101047339 | 101047873
(30,2) | 2102 | 4113 | 101047873 | 101048273
(30,3) | 2103 | 47 | 101048273 | 101048691
(30,4) | 2104 | 3766 | 101048691 | 101049398
(30,5) | 2105 | 4133 | 101049398 | 101049842
(30,6) | 2106 | 2915 | 101049842 | 101050766
(30,7) | 2107 | 2489 | 101050766 | 101050799
(30,8) | 2108 | 4325 | 101050799 | 101051401
(30,9) | 2109 | 2204 | 101051401 | 101052374
(30,10) | 2110 | 1397 | 101052374 | 101052726
(10 rows)

Time: 0.457 ms
postgres=# select ctid,* from tbl TABLESAMPLE system (5) limit 10;
ctid | id | loc | beginid | endid
--------+----+------+-----------+-----------
(0,1) | 1 | 234 | 100000000 | 100000295
(0,2) | 2 | 1153 | 100000295 | 100000793
(0,3) | 3 | 4127 | 100000793 | 100001203
(0,4) | 4 | 3287 | 100001203 | 100001450
(0,5) | 5 | 2156 | 100001450 | 100002286
(0,6) | 6 | 1097 | 100002286 | 100002351
(0,7) | 7 | 2302 | 100002351 | 100002625
(0,8) | 8 | 1768 | 100002625 | 100003382
(0,9) | 9 | 39 | 100003382 | 100003630
(0,10) | 10 | 887 | 100003630 | 100004209
(10 rows)

Time: 0.427 ms
postgres=# select ctid,* from tbl TABLESAMPLE system (5) limit 10;
ctid | id | loc | beginid | endid
---------+------+------+-----------+-----------
(43,1) | 3011 | 4174 | 101493150 | 101493628
(43,2) | 3012 | 798 | 101493628 | 101494207
(43,3) | 3013 | 1334 | 101494207 | 101495189
(43,4) | 3014 | 4870 | 101495189 | 101495937
(43,5) | 3015 | 1920 | 101495937 | 101496065
(43,6) | 3016 | 2561 | 101496065 | 101496782
(43,7) | 3017 | 1617 | 101496782 | 101497469
(43,8) | 3018 | 3226 | 101497469 | 101497864
(43,9) | 3019 | 4582 | 101497864 | 101498854
(43,10) | 3020 | 2110 | 101498854 | 101499388
(10 rows)

Time: 1.528 ms
```

3、SYSTEM + random 概率过滤 , 解决system返回的记录不够离散的问题.
性能也能接受, 越大的表, 性能越是突出.
离散度取决于random概率条件, 返回的结果就非常离散了

```
postgres=# select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;
ctid | id | loc | beginid | endid
----------+-------+------+-----------+-----------
(3,19) | 229 | 2106 | 100110347 | 100110461
(3,45) | 255 | 4334 | 100124594 | 100125503
(149,43) | 10473 | 4125 | 105216168 | 105216240
(160,12) | 11212 | 787 | 105591453 | 105591718
(160,36) | 11236 | 106 | 105600343 | 105600824
(160,47) | 11247 | 4053 | 105606511 | 105607286
(203,48) | 14258 | 3078 | 107103358 | 107104264
(278,29) | 19489 | 3268 | 109723508 | 109723920
(286,10) | 20030 | 1048 | 109999654 | 110000489
(315,31) | 22081 | 2826 | 111013999 | 111014368
(10 rows)

Time: 2.837 ms

postgres=# select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;
ctid | id | loc | beginid | endid
----------+-------+------+-----------+-----------
(69,26) | 4856 | 7 | 102417440 | 102417597
(71,65) | 5035 | 1092 | 102510582 | 102511527
(77,39) | 5429 | 1120 | 102704924 | 102705238
(152,31) | 10671 | 1466 | 105323336 | 105324170
(176,27) | 12347 | 4757 | 106150494 | 106151243
(176,59) | 12379 | 1195 | 106167151 | 106167416
(206,70) | 14490 | 3052 | 107221746 | 107222087
(229,46) | 16076 | 463 | 107995891 | 107996698
(242,29) | 16969 | 1865 | 108461627 | 108461736
(242,62) | 17002 | 4863 | 108477461 | 108478270
(10 rows)

Time: 2.667 ms
postgres=# select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;
ctid | id | loc | beginid | endid
----------+-------+------+-----------+-----------
(1,54) | 124 | 1346 | 100056875 | 100057832
(9,69) | 699 | 2227 | 100347977 | 100348861
(14,54) | 1034 | 3879 | 100520998 | 100521176
(22,60) | 1600 | 2226 | 100802698 | 100803154
(28,30) | 1990 | 1133 | 100996297 | 100996580
(28,57) | 2017 | 189 | 101009211 | 101009952
(28,69) | 2029 | 2109 | 101015988 | 101016238
(106,1) | 7421 | 2469 | 103702081 | 103702435
(130,67) | 9167 | 1801 | 104582977 | 104583754
(149,20) | 10450 | 4126 | 105204544 | 105204560
(10 rows)

Time: 3.157 ms

postgres=# explain select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;
QUERY PLAN

Limit (cost=0.00..0.92 rows=10 width=90)
-> Sample Scan on tbl (cost=0.00..1535.40 rows=16667 width=90)
Sampling: system ('5'::real)
Filter: (random() < '0.01'::double precision)
(4 rows)

Time: 0.858 ms
```

4、除了内置的采样方法, PG还支持扩展采样方法, 例如按时间采样, 随机扫描并且只扫描指定的时间, 例如最多扫描1毫秒, 返回这1毫秒扫描到的记录.

同时还支持按行数返回, 例如最多采样100行.

《PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time》

```
-- 最多5毫秒
postgres=# select ctid,* from tbl TABLESAMPLE system_time (5) where random()<0.02 limit 10;
ctid | id | loc | beginid | endid
------------+--------+------+-----------+-----------
(13943,32) | 976042 | 786 | 587732405 | 587732831
(5014,17) | 350997 | 4515 | 275520535 | 275520831
(10371,58) | 726028 | 3201 | 462659708 | 462660690
(10371,67) | 726037 | 2160 | 462664042 | 462664194
(1442,8) | 100948 | 1134 | 150431878 | 150432150
(1442,22) | 100962 | 2839 | 150436680 | 150437122
(1442,36) | 100976 | 3027 | 150444363 | 150444375
(6799,50) | 475980 | 4172 | 337780753 | 337781376
(6799,51) | 475981 | 1971 | 337781376 | 337781830
(12156,31) | 850951 | 3287 | 525181282 | 525182019
(10 rows)

Time: 0.908 ms

postgres=# explain select ctid,* from tbl TABLESAMPLE system_time (5) where random()<0.02 limit 10;
QUERY PLAN

Limit (cost=0.00..0.92 rows=10 width=90)
-> Sample Scan on tbl (cost=0.00..10.75 rows=117 width=90)
Sampling: system_time ('5'::double precision)
Filter: (random() < '0.02'::double precision)
(4 rows)

Time: 0.697 ms
```

```
- 最多1000行
-- 1000概率0.01 = 10 所以limit 10可以忽略
postgres=# select ctid, from tbl TABLESAMPLE system_rows(1000) where random()<0.01 ;
ctid | id | loc | beginid | endid
-----------+--------+------+-----------+-----------
(8522,48) | 596588 | 2148 | 397979059 | 397980022
(8454,21) | 591801 | 670 | 395580178 | 395581156
(8454,70) | 591850 | 1441 | 395608906 | 395609174
(1209,13) | 84643 | 3924 | 142242550 | 142243369
(8318,36) | 582296 | 4625 | 390811869 | 390811932
(1141,5) | 79875 | 2093 | 139858769 | 139859220
(1141,70) | 79940 | 4611 | 139892888 | 139893837
(1005,14) | 70364 | 2328 | 135109401 | 135109605
(1005,23) | 70373 | 2382 | 135113320 | 135113475
(9 rows)

Time: 2.370 ms
```

对比性能

```
postgres=# select count(*) from tbl;
count

1000000
(1 row)
```

100万条记录的随机扫描对比

```
IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t1.sql -c 12 -j 12 -T 10
progress: 1.0 s, 11.9 tps, lat 685.136 ms stddev 65.024
progress: 2.0 s, 13.1 tps, lat 669.947 ms stddev 18.605
progress: 3.0 s, 22.9 tps, lat 708.591 ms stddev 43.357
progress: 4.0 s, 13.0 tps, lat 641.470 ms stddev 24.108
progress: 5.0 s, 22.9 tps, lat 676.400 ms stddev 37.549
progress: 6.0 s, 13.0 tps, lat 689.115 ms stddev 17.644
progress: 7.0 s, 18.9 tps, lat 767.281 ms stddev 22.536
progress: 8.0 s, 15.0 tps, lat 819.248 ms stddev 39.973
progress: 9.0 s, 13.0 tps, lat 856.244 ms stddev 29.988
progress: 10.0 s, 12.0 tps, lat 829.066 ms stddev 22.431
pgbench (PostgreSQL) 14.0
transaction type: ./t1.sql
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 12
duration: 10 s
number of transactions actually processed: 168
latency average = 749.340 ms
latency stddev = 101.206 ms
initial connection time = 11.026 ms
tps = 15.804188 (without initial connection time)
statement latencies in milliseconds:
749.340 select * from tbl order by random() limit 10;

IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t2.sql -c 12 -j 12 -T 10
progress: 1.0 s, 32119.4 tps, lat 0.367 ms stddev 0.932
progress: 2.0 s, 31335.6 tps, lat 0.384 ms stddev 1.286
progress: 3.0 s, 31556.0 tps, lat 0.376 ms stddev 1.679
progress: 4.0 s, 30996.3 tps, lat 0.390 ms stddev 1.413
progress: 5.0 s, 30622.0 tps, lat 0.390 ms stddev 1.006
progress: 6.0 s, 30778.7 tps, lat 0.384 ms stddev 2.160
progress: 7.0 s, 29766.9 tps, lat 0.405 ms stddev 2.657
progress: 8.0 s, 27425.5 tps, lat 0.443 ms stddev 2.396
progress: 9.0 s, 25860.8 tps, lat 0.462 ms stddev 3.315
progress: 10.0 s, 25695.2 tps, lat 0.462 ms stddev 2.708
pgbench (PostgreSQL) 14.0
transaction type: ./t2.sql
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 12
duration: 10 s
number of transactions actually processed: 296261
latency average = 0.404 ms
latency stddev = 2.059 ms
initial connection time = 13.047 ms
tps = 29641.685598 (without initial connection time)
statement latencies in milliseconds:
0.406 select ctid,* from tbl TABLESAMPLE SYSTEM (5) where random()<0.01 limit 10;

IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t3.sql -c 12 -j 12 -T 10
progress: 1.0 s, 28849.9 tps, lat 0.407 ms stddev 1.055
progress: 2.0 s, 28843.1 tps, lat 0.418 ms stddev 1.950
progress: 3.0 s, 27394.0 tps, lat 0.433 ms stddev 2.971
progress: 4.0 s, 28447.4 tps, lat 0.422 ms stddev 2.686
progress: 5.0 s, 26309.0 tps, lat 0.458 ms stddev 2.987
progress: 6.0 s, 29036.4 tps, lat 0.413 ms stddev 0.963
progress: 7.0 s, 29986.4 tps, lat 0.402 ms stddev 1.382
progress: 8.0 s, 29973.1 tps, lat 0.400 ms stddev 1.649
progress: 9.0 s, 25595.4 tps, lat 0.468 ms stddev 2.718
progress: 10.0 s, 26574.2 tps, lat 0.451 ms stddev 3.553
pgbench (PostgreSQL) 14.0
transaction type: ./t3.sql
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 12
duration: 10 s
number of transactions actually processed: 281140
latency average = 0.426 ms
latency stddev = 2.321 ms
initial connection time = 15.631 ms
tps = 28142.235104 (without initial connection time)
statement latencies in milliseconds:
0.429 select ctid,* from tbl TABLESAMPLE SYSTEM_ROWS (1000) where random()<0.01;

IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t4.sql -c 12 -j 12 -T 10
progress: 1.0 s, 26988.7 tps, lat 0.439 ms stddev 1.327
progress: 2.0 s, 24800.7 tps, lat 0.472 ms stddev 2.408
progress: 3.0 s, 25742.0 tps, lat 0.470 ms stddev 3.191
progress: 4.0 s, 29388.5 tps, lat 0.414 ms stddev 2.189
progress: 5.0 s, 29199.8 tps, lat 0.404 ms stddev 0.590
progress: 6.0 s, 25906.5 tps, lat 0.470 ms stddev 3.301
progress: 7.0 s, 29297.8 tps, lat 0.409 ms stddev 1.143
progress: 8.0 s, 29572.6 tps, lat 0.405 ms stddev 0.805
progress: 9.0 s, 26367.6 tps, lat 0.443 ms stddev 2.976
progress: 10.0 s, 25389.1 tps, lat 0.475 ms stddev 3.652
pgbench (PostgreSQL) 14.0
transaction type: ./t4.sql
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 12
duration: 10 s
number of transactions actually processed: 272779
latency average = 0.439 ms
latency stddev = 2.379 ms
initial connection time = 13.144 ms
tps = 27255.103831 (without initial connection time)
statement latencies in milliseconds:
0.441 select ctid,* from tbl TABLESAMPLE SYSTEM_time (5) where random()<0.01 limit 10;
```

有索引过滤条件, 并且满足条件的记录占比很少, 怎么办?

1000万行, 满足条件1万行. 用随机采样可能要扫描很多个block才能拿到精确的数据.

方法1, 传统的order by random().

```
create table t (id int primary key, c1 int, info text, crt_time timestamp);
insert into t select generate_series(1,10000000), random()*999, md5(random()::text), clock_timestamp();
create index idx_t_1 on t (c1);

postgres=# explain select * from t where c1=0 limit 10;
QUERY PLAN

Limit (cost=0.43..11.11 rows=10 width=49)
-> Index Scan using idx_t_1 on t (cost=0.43..10623.39 rows=9947 width=49)
Index Cond: (c1 = 0)
(3 rows)

postgres=# explain select * from t where c1=0 order by random() limit 10;
QUERY PLAN

Limit (cost=10863.21..10863.23 rows=10 width=57)
-> Sort (cost=10863.21..10888.07 rows=9947 width=57)
Sort Key: (random())
-> Index Scan using idx_t_1 on t (cost=0.43..10648.25 rows=9947 width=57)
Index Cond: (c1 = 0)
(5 rows)

IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 12 -j 12 -T 10
progress: 1.0 s, 1165.9 tps, lat 9.960 ms stddev 14.434
progress: 2.0 s, 1298.1 tps, lat 9.319 ms stddev 12.472
progress: 3.0 s, 1313.8 tps, lat 9.137 ms stddev 11.276
progress: 4.0 s, 1161.6 tps, lat 10.068 ms stddev 13.447
progress: 5.0 s, 972.2 tps, lat 12.399 ms stddev 20.388
progress: 6.0 s, 999.8 tps, lat 12.249 ms stddev 17.811
progress: 7.0 s, 1153.3 tps, lat 10.366 ms stddev 13.479
progress: 8.0 s, 1180.6 tps, lat 9.992 ms stddev 12.386
progress: 9.0 s, 1029.6 tps, lat 11.916 ms stddev 17.798
progress: 10.0 s, 966.5 tps, lat 12.252 ms stddev 17.236
pgbench (PostgreSQL) 14.0
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 12
duration: 10 s
number of transactions actually processed: 11257
latency average = 10.658 ms
latency stddev = 15.150 ms
initial connection time = 13.008 ms
tps = 1124.835901 (without initial connection time)
statement latencies in milliseconds:
11.003 select * from t where c1=0 order by random() limit 10;
```

方法2, 用索引得到满足条件的blocks(这一步最耗时), 从这些blocks中得到随机的10个block, 从这10个blocks的所有记录中随机返回10条.

```
select t. from t where exists
( select 1 from
(select blkid from (select substring(ctid::text,'(\d),') blkid from t where c1=0 group by 1) t -- 获取所有满足条件的BLOCK ID
order by random() limit 10) t1 -- 获取10个随机block id
where t.ctid>=('('||t1.blkid||',0)')::tid -- 获取这10个block的所有记录
and t.ctid< ('('||t1.blkid::int+1||',0)')::tid -- 获取这10个block的所有记录
and t.c1=0 ) -- 过滤出c1=0的
order by random()
limit 10; -- 返回10条

                                                                                QUERY PLAN

Limit (cost=26432.50..26432.52 rows=10 width=57)
-> Sort (cost=26432.50..26435.26 rows=1106 width=57)
Sort Key: (random())
-> Nested Loop Semi Join (cost=11246.98..26408.60 rows=1106 width=57)
Join Filter: ((t.ctid >= ((('('::text || t1.blkid) || ',0)'::text))::tid) AND (t.ctid < ((('('::text || (((t1.blkid)::integer + 1))::text) || ',0)'::text))::tid))
-> Index Scan using idx_t_1 on t (cost=0.43..10633.09 rows=9957 width=55)
Index Cond: (c1 = 0)
-> Materialize (cost=11246.54..11246.72 rows=10 width=32)
-> Subquery Scan on t1 (cost=11246.54..11246.67 rows=10 width=32)
-> Limit (cost=11246.54..11246.57 rows=10 width=40)
-> Sort (cost=11246.54..11271.43 rows=9957 width=40)
Sort Key: (random())
-> Subquery Scan on t_1 (cost=10732.66..11031.37 rows=9957 width=40)
-> HashAggregate (cost=10732.66..10906.91 rows=9957 width=32)
Group Key: "substring"((t_2.ctid)::text, '(\d*),'::text)
-> Index Scan using idx_t_1 on t t_2 (cost=0.43..10707.77 rows=9957 width=32)
Index Cond: (c1 = 0)
(17 rows)

Time: 0.707 ms

IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 12 -j 12 -T 10
progress: 1.0 s, 113.9 tps, lat 99.518 ms stddev 20.071
progress: 2.0 s, 125.1 tps, lat 95.576 ms stddev 10.811
progress: 3.0 s, 117.7 tps, lat 101.022 ms stddev 14.942
progress: 4.0 s, 95.2 tps, lat 124.592 ms stddev 26.282
progress: 5.0 s, 93.8 tps, lat 128.335 ms stddev 15.483
progress: 6.0 s, 95.9 tps, lat 128.530 ms stddev 14.861
progress: 7.0 s, 101.4 tps, lat 115.433 ms stddev 15.376
progress: 8.0 s, 116.0 tps, lat 104.154 ms stddev 10.846
progress: 9.0 s, 112.7 tps, lat 105.956 ms stddev 11.932
progress: 10.0 s, 110.3 tps, lat 107.870 ms stddev 11.686
pgbench (PostgreSQL) 14.0
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 12
duration: 10 s
number of transactions actually processed: 1095
latency average = 109.931 ms
latency stddev = 19.405 ms
initial connection time = 10.773 ms
tps = 108.826246 (without initial connection time)
statement latencies in milliseconds:
109.965 select t.* from t where exists
```

或者我们可以在业务层定期查出, 放到换成中业务自己去随机返回给客户, 又或者直接放在数据库中作为临时表,定期刷新内容即可

``` postgres=# create unlogged table tmp_t_c1_1 (like t); CREATE TABLE Time: 6.524 ms postgres=# insert into tmp_t_c1_1 select * from t where c1=1; INSERT 0 10001 Time: 51.261 ms postgres=# select * from tmp_t_c1_1 tables

postgres=# select * from tmp_t_c1_1 tablesample system_rows(1000) where random()<0.01 ; id | c1 | info | crt_time
---------+----+----------------------------------+---------------------------- 7869277 | 1 | b4483c7728566a238de08f77eba12774 | 2021-05-27 16:19:32.566097 7881151 | 1 | 49af1161c2e3a779328b32ddd058c7f0 | 2021-05-27 16:19:32.588799 1226520 | 1 | 46981759668b3a9acb953ab8b279f4ea | 2021-05-27 16:19:16.06893 1244566 | 1 | db248939b6c6647b25b1a32584747be8 | 2021-05-27 16:19:16.10505 1251111 | 1 | 9096bc3b83f80a20342b47c483c66b3b | 2021-05-27 16:19:16.118112 4677752 | 1 | 97dbe82b4f6476ee50d6a37775fb3e6f | 2021-05-27 16:19:24.337757 7984355 | 1 | 4f996ed3be9fb0fbd26b0444f44681d6 | 2021-05-27 16:19:32.817659 7996207 | 1 | bd4d80164abf2b17153e6b10fc871725 | 2021-05-27 16:19:32.840801 8020673 | 1 | 95765bac6bf7952bc59a6bc8d5db24f8 | 2021-05-27 16:19:32.887891 1322032 | 1 | 0f769db9cb6c5a920dba9ffd364b0a7d | 2021-05-27 16:19:16.285266 4734787 | 1 | 75f8b9e9ce72e7f5733efeaf0ed40215 | 2021-05-27 16:19:24.449741 8128961 | 1 | c356f3d289459dbe099769c8bd786e2c | 2021-05-27 16:19:33.124422 1492978 | 1 | 0619d948880cf16f7ce0c8e405525621 | 2021-05-27 16:19:16.678358 (13 rows)

Time: 4.226 ms postgres=# select * from tmp_t_c1_1 tablesample system_rows(1000) where random()<0.01 ; id | c1 | info | crt_time
---------+----+----------------------------------+---------------------------- 3659031 | 1 | edb2c2ee58aa530fbe3ec8199c31914c | 2021-05-27 16:19:21.990993 3705687 | 1 | 47dd5dce35ffdf84792a5c64f573f5cb | 2021-05-27 16:19:22.086795 2578458 | 1 | bc97c41db6329221ae30fe604eb222a2 | 2021-05-27 16:19:19.382254 1518045 | 1 | 78f90fa17050b2eacd3222c9948606dd | 2021-05-27 16:19:16.727858 1553043 | 1 | 48d2363be6446e266d074c6e98a856d5 | 2021-05-27 16:19:16.796445 1587461 | 1 | 9329578a7b41b36b4919e4c25b9e1c31 | 2021-05-27 16:19:16.890558 410551 | 1 | 6184732511cfad0b05c75c1865a9e920 | 2021-05-27 16:19:13.974389 9438915 | 1 | cb0fc9af4da9259c3f32ec37180a6774 | 2021-05-27 16:19:36.463244 9501777 | 1 | b82cb22ba0e7620b21cb6e9219d6f425 | 2021-05-27 16:19:36.620395 7345643 | 1 | 9820565758529cfb752135cc5e523adf | 2021-05-27 16:19:31.304132 6273533 | 1 | a6cdf268caeeeeda3d766d2370e84655 | 2021-05-27 16:19:28.488774 6323667 | 1 | b000452c87b882ff00e4aff2abe34cc9 | 2021-05-27 16:19:28.613903 6346355 | 1 | 6810c56b96bdcaf2337ce23797b9d118 | 2021-05-27 16:19:28.669933 5240569 | 1 | 2e33eaf8ea4e12cf8e3bcae6022c48c3 | 2021-05-27 16:19:25.660275 5243275 | 1 | 6bcdd1dea42daf07302310c0edbfdaa7 | 2021-05-27 16:19:25.666971 4184755 | 1 | 8c04c876d1fe13b7754ceb8eb4782969 | 2021-05-27 16:19:23.177027 (16 rows)

Time: 0.732 ms ```

思考

关于过滤性好的随机获取N条的问题?

通过修改内核索引结构也许有更好的效果, 解决必须通过索引链表顺序扫描全部满足条件数据后再随机的问题.

参考

202005/20200509_01.md 《PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time》

201906/20190613_01.md 《PostgreSQL 一条简单、“有人情味”的抽随机奖SQL》

201810/20181009_01.md 《PostgreSQL 随机记录返回 - 300倍提速实践 (随机数组下标代替order by random())》

201808/20180810_01.md 《PostgreSQL 内容随机推荐系统开发实践 - 文章随机推荐》

201706/20170602_02.md 《PostgreSQL Oracle 兼容性之 - 数据采样与脱敏》

201609/20160929_01.md 《PostgreSQL 巧妙的数据采样方法》

201505/20150525_01.md 《PostgreSQL 9.5 new feature - table | mview data sample》

201102/20110212_01.md 《PostgreSQL 随机查询优化》

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL 随机查询采样 - 既要真随机、又要高性能 - table sample方法

作者

日期

标签

背景

对比性能

有索引过滤条件, 并且满足条件的记录占比很少, 怎么办?

思考

参考

202005/20200509_01.md 《PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time》

201906/20190613_01.md 《PostgreSQL 一条简单、“有人情味”的抽随机奖SQL》

201810/20181009_01.md 《PostgreSQL 随机记录返回 - 300倍提速实践 (随机数组下标代替order by random())》

201808/20180810_01.md 《PostgreSQL 内容随机推荐系统开发实践 - 文章随机推荐》

201706/20170602_02.md 《PostgreSQL Oracle 兼容性之 - 数据采样与脱敏》

201609/20160929_01.md 《PostgreSQL 巧妙的数据采样方法》

201505/20150525_01.md 《PostgreSQL 9.5 new feature - table | mview data sample》

201102/20110212_01.md 《PostgreSQL 随机查询优化》

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论

PostgreSQL 随机查询采样 - 既要真随机、又要高性能 - table sample方法

作者

日期

标签

背景

对比性能

有索引过滤条件, 并且满足条件的记录占比很少, 怎么办?

思考

参考

202005/20200509_01.md 《PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time》

201906/20190613_01.md 《PostgreSQL 一条简单、“有人情味”的抽随机奖SQL》

201810/20181009_01.md 《PostgreSQL 随机记录返回 - 300倍提速实践 (随机数组下标代替order by random())》

201808/20180810_01.md 《PostgreSQL 内容随机推荐系统开发实践 - 文章随机推荐》

201706/20170602_02.md 《PostgreSQL Oracle 兼容性 之 - 数据采样与脱敏》

201609/20160929_01.md 《PostgreSQL 巧妙的数据采样方法》

201505/20150525_01.md 《PostgreSQL 9.5 new feature - table | mview data sample》

201102/20110212_01.md 《PostgreSQL 随机查询优化》

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论

201706/20170602_02.md 《PostgreSQL Oracle 兼容性之 - 数据采样与脱敏》