暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL pgbnech 支持 长尾模型数据生成 - 离散幂律概率分布 - random_zipfian

digoal 2021-01-05
701

作者

digoal

日期

2021-05-19

标签

PostgreSQL , random_zipfian , pgbench , 长尾模型 , 离散幂律概率分布


背景

https://blog.csdn.net/u010945683/article/details/78843655

最近在看Multi-View Learning of Word Embeddings via CCA这篇文章时,里面提到了一个陌生的概率分布模型.

Zipfian distribution

一个离散幂律概率分布,也就是常常提到的长尾模型.在自然语言的语料库里,一个单词出现的频率与它在频率表里的排名成反比。频率最高的单词出现的频率大约是出现频率第二位的单词的2倍,是出现频率第三位的单词的3倍。
在Brown语料库中出现次数最多的词”the”占语料库中所有词出现次数的7%,排名第二的是”of”占3.5%.

pgbench 支持zipfian生成算法: 用于长尾模型.

https://www.postgresql.org/docs/devel/pgbench.html

```
公平随机分布
random ( lb, ub ) → integer
Computes a uniformly-distributed random integer in [lb, ub].
random(1, 10) → an integer between 1 and 10

指数分布
random_exponential ( lb, ub, parameter ) → integer
Computes an exponentially-distributed random integer in [lb, ub], see below.
random_exponential(1, 10, 3.0) → an integer between 1 and 10

高斯分布
random_gaussian ( lb, ub, parameter ) → integer
Computes a Gaussian-distributed random integer in [lb, ub], see below.
random_gaussian(1, 10, 2.5) → an integer between 1 and 10

离散幂律概率分布
random_zipfian ( lb, ub, parameter ) → integer
Computes a Zipfian-distributed random integer in [lb, ub], see below.
random_zipfian(1, 10, 1.5) → an integer between 1 and 10
```

```
random_zipfian generates a bounded Zipfian distribution.

parameter defines how skewed the distribution is.
The larger the parameter, the more frequently values closer to the beginning of the interval are drawn.
The distribution is such that, assuming the range starts from 1,
the ratio of the probability of drawing k versus drawing k+1 is

((k+1)/k)**parameter

For example,
random_zipfian(1, ..., 2.5) produces the value:
1 about (2/1)2.5 = 5.66 times more frequently than 2,
which 2 is produced (3/2)
2.5 = 2.76 times more frequently than 3,
and so on.
```

1出现的概率是2出现概率的5.66倍
2出现的概率是3出现概率的2.76倍

pgbench压测例子

``` vi t.sql \set size 10000000 \set r random_zipfian(1, :size, 2) \set k permute(:r, :size) \startpipeline insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); insert into tbl values (:r, :k); \endpipeline

pgbench -M prepared -n -r -P 1 -f ./t.sql -c 8 -j 8 -T 120

postgres=# create table tbl(c1 int, c2 int);

postgres=# select count(distinct c1), count(distinct c2) from tbl; count | count -------+------- 1096 | 2097 (1 row)

c1被映射到随机位置

postgres=# select c1,c2,count(*) from tbl group by 1,2 order by 1; c1 | c2 | count
--------+---------+-------- 1 | 896298 | 39019 1 | 5713084 | 67095 1 | 6611514 | 89746 1 | 8536650 | 206740 2 | 521883 | 51963 2 | 4019018 | 16880 2 | 8021937 | 22673 2 | 9983772 | 9706 3 | 6304463 | 22955 3 | 7681121 | 10035 3 | 7921875 | 4412 3 | 8685496 | 7477 4 | 6661420 | 13056 4 | 7145966 | 2393 4 | 7175403 | 4168 4 | 7660974 | 5634 ... ... 47915 | 1488672 | 1 48927 | 2723112 | 1 77704 | 7517990 | 1 80241 | 8421942 | 1 90213 | 1020744 | 1 584837 | 1755648 | 1 975187 | 763529 | 1 (2097 rows) ```

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论