作者
digoal
日期
2021-05-19
标签
PostgreSQL , random_zipfian , pgbench , 长尾模型 , 离散幂律概率分布
背景
https://blog.csdn.net/u010945683/article/details/78843655
最近在看Multi-View Learning of Word Embeddings via CCA这篇文章时,里面提到了一个陌生的概率分布模型.
Zipfian distribution
一个离散幂律概率分布,也就是常常提到的长尾模型.在自然语言的语料库里,一个单词出现的频率与它在频率表里的排名成反比。频率最高的单词出现的频率大约是出现频率第二位的单词的2倍,是出现频率第三位的单词的3倍。
在Brown语料库中出现次数最多的词”the”占语料库中所有词出现次数的7%,排名第二的是”of”占3.5%.
pgbench 支持zipfian生成算法: 用于长尾模型.
https://www.postgresql.org/docs/devel/pgbench.html
```
公平随机分布
random ( lb, ub ) → integer
Computes a uniformly-distributed random integer in [lb, ub].
random(1, 10) → an integer between 1 and 10
指数分布
random_exponential ( lb, ub, parameter ) → integer
Computes an exponentially-distributed random integer in [lb, ub], see below.
random_exponential(1, 10, 3.0) → an integer between 1 and 10
高斯分布
random_gaussian ( lb, ub, parameter ) → integer
Computes a Gaussian-distributed random integer in [lb, ub], see below.
random_gaussian(1, 10, 2.5) → an integer between 1 and 10
离散幂律概率分布
random_zipfian ( lb, ub, parameter ) → integer
Computes a Zipfian-distributed random integer in [lb, ub], see below.
random_zipfian(1, 10, 1.5) → an integer between 1 and 10
```
```
random_zipfian generates a bounded Zipfian distribution.
parameter defines how skewed the distribution is.
The larger the parameter, the more frequently values closer to the beginning of the interval are drawn.
The distribution is such that, assuming the range starts from 1,
the ratio of the probability of drawing k versus drawing k+1 is
((k+1)/k)**parameter
For example,
random_zipfian(1, ..., 2.5) produces the value:
1 about (2/1)2.5 = 5.66 times more frequently than 2,
which 2 is produced (3/2)2.5 = 2.76 times more frequently than 3,
and so on.
```
1出现的概率是2出现概率的5.66倍
2出现的概率是3出现概率的2.76倍
pgbench压测例子
``` vi t.sql \set size 10000000 \set r random_zipfian(1, :size, 2) \set k permute(:r, :size) \startpipeline insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); \endpipeline
pgbench -M prepared -n -r -P 1 -f ./t.sql -c 8 -j 8 -T 120
postgres=# create table tbl(c1 int, c2 int);
postgres=# select count(distinct c1), count(distinct c2) from tbl; count | count -------+------- 1096 | 2097 (1 row)
c1被映射到随机位置
postgres=# select c1,c2,count(*) from tbl group by 1,2 order by 1;
c1 | c2 | count
--------+---------+--------
1 | 896298 | 39019
1 | 5713084 | 67095
1 | 6611514 | 89746
1 | 8536650 | 206740
2 | 521883 | 51963
2 | 4019018 | 16880
2 | 8021937 | 22673
2 | 9983772 | 9706
3 | 6304463 | 22955
3 | 7681121 | 10035
3 | 7921875 | 4412
3 | 8685496 | 7477
4 | 6661420 | 13056
4 | 7145966 | 2393
4 | 7175403 | 4168
4 | 7660974 | 5634
... ...
47915 | 1488672 | 1
48927 | 2723112 | 1
77704 | 7517990 | 1
80241 | 8421942 | 1
90213 | 1020744 | 1
584837 | 1755648 | 1
975187 | 763529 | 1
(2097 rows)
```
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.
9.9元购买3个月阿里云RDS PostgreSQL实例
PostgreSQL 解决方案集合
德哥 / digoal's github - 公益是一辈子的事.





