
大家好,这次大表哥和大家分享的是PG的统计信息和扩展统计信息。
当然我们今天讨论的统计信息是基于表和索引的统计信息,PG 的 query planner 生成执行计划的依据就是统计信息和 cost 常量参数。
关于 planner cost 常量 类似于计算成本中的各种加权因子 具体可以参考官方文档: https://www.postgresql.org/docs/14/runtime-config-query.html
常见的有: seq_page_cost, cpu_tuple_cost, cpu_index_tuple_cost …
而表和索引的统计信息,为执行计划的生成提供了基础的数据统计支持。
我们先来看一下 row estimate 或者 cardinality :
对于行的基数的预估是参考: pg_class 的属性 reltuples
dbtest@[local:/tmp]:1992=#60266 create table t1(id int, name varchar(200));
CREATE TABLE
dbtest@[local:/tmp]:1992=#60266 insert into t1 values (generate_series(1,1000),'hello PG!');
INSERT 0 1000
dbtest@[local:/tmp]:1992=#60266 select reltuples from pg_class where relname = 't1';
reltuples
-----------
1000
dbtest@[local:/tmp]:1992=#60266 explain analyze select * from t1;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..16.00 rows=1000 width=14) (actual time=0.016..0.207 rows=1000 loops=1)
Planning Time: 0.214 ms
Execution Time: 0.385 ms
(3 rows)
我们看一下 SQL 语句加上 where 过滤条件之后的 cardinality 的预估:
dbtest@[local:/tmp]:1992=#92372 explain analyze select * from t1 where id <= 505;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..18.50 rows=505 width=14) (actual time=0.021..0.139 rows=505 loops=1)
Filter: (id <= 505)
Rows Removed by Filter: 495
Planning Time: 0.344 ms
Execution Time: 0.233 ms
(5 rows)
我们来看一下 column id 的直方图分布 一共100个桶,范围是10
dbtest@[local:/tmp]:1992=#60266 SELECT histogram_bounds FROM pg_stats
dbtest-# WHERE tablename='t1' AND attname='id';
histogram_bounds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------
{1,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,49
0,500,510,520,530,540,550,560,570,580,590,600,610,620,630,640,650,660,670,680,690,700,710,720,730,740,750,760,770,780,790,800,810,820,830,840,850,860,870,880,890,900,910,920,930,940,950,960
,970,980,990,1000}
我们看一下官方文档上给出的 选择率的计算: id <= 505 所在的桶 [505,510] 所在的桶是 51号桶, 由于是 height-balanced 的直方图那么默认的
公式是 N 是 查询条件所在的 桶的编号
selectivity = ((N-1) + (cer - bucket[N].min)/(bucket.max[N] - bucket.min[N]))/num_buckets
selectivity = (50 + (505 - bucket.min)/(bucket.max - bucket.min))/num_buckets = (50 + (505 - 500)/(510 - 500))/100 = 0.505
所以过滤后的cardinality 就是
reltuples * selectivity = 1000 * 0.505 = 505 这个预期是符合我们执行计划中的 rows=505
我们现在看一下 等值的情况:
dbtest@[local:/tmp]:1992=#60266 update t1 set name = 'Great PG!' where id <= 10;
UPDATE 10
dbtest@[local:/tmp]:1992=#60266 select count(1), name from t1 group by name;
count | name
-------+-----------
10 | Great PG!
990 | hello PG!
(2 rows)
dbtest@[local:/tmp]:1992=#60266 analyze t1;
ANALYZE
我们分别 用 where name = ‘Great PG!’ 和 where name = ‘hello PG!’ 查看一下执行计划里面的 cardinality:
name = ‘Great PG!’ : 预估的 rows=10
name = ‘hello PG!’ : 预估的 rows=990
dbtest@[local:/tmp]:1992=#60266 explain analyze select * from t1 where name = 'Great PG!';
QUERY PLAN
------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..18.50 rows=10 width=14) (actual time=0.090..0.091 rows=10 loops=1)
Filter: ((name)::text = 'Great PG!'::text)
Rows Removed by Filter: 990
Planning Time: 0.084 ms
Execution Time: 0.104 ms
(5 rows)
dbtest@[local:/tmp]:1992=#60266 explain analyze select * from t1 where name = 'hello PG!';
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..18.50 rows=990 width=14) (actual time=0.018..0.194 rows=990 loops=1)
Filter: ((name)::text = 'hello PG!'::text)
Rows Removed by Filter: 10
Planning Time: 0.061 ms
Execution Time: 0.258 ms
(5 rows)
我们看一下 name 列上的值分布: 我们可以看到 列上出现了 类似 high frequency 的直方图:
n_distinct :2 只有2个值
most_common_vals: {“hello PG!”,“Great PG!”}
most_common_freqs: {0.99,0.01}
由于重复值的频率很高,这里PG的并不会采用 height balance 测直方图 histogram_bounds 是 null
name = ‘Great PG!’ : 预估的 rows=10 = reltuples * most_common_freqs = 1000* 0.01 = 10
name = ‘hello PG!’ : 预估的 rows=990 = reltuples * most_common_freqs = 1000* 0.99 = 990
dbtest@[local:/tmp]:1992=#60266 SELECT * FROM pg_stats
WHERE tablename='t1' AND attname='name';
-[ RECORD 1 ]----------+--------------------------
schemaname | public
tablename | t1
attname | name
inherited | f
null_frac | 0
avg_width | 10
n_distinct | 2
most_common_vals | {"hello PG!","Great PG!"}
most_common_freqs | {0.99,0.01}
histogram_bounds |
correlation | 0.9405999
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
我们再来看一下 where 条件中 含有2个不同的列的组合情况下,cardinality 的预估:
select * from t1 where name = ‘hello PG!’ and id > 500 ;
我们 cardinality rows=495 是如何根据统计信息预估的? 其实和简单就是概率学上 两个列值的 selectivity1 * selectivity2
selectivity = selectivity(name = ‘hello PG!’) * selectivity(id <= 505)
= 0.99 * (50 + (505 - 500)/(510 - 500))/100
= 0.99 * 0.505
= 0.49995
rows = 1000 * 0.49995
=499.95
= 约等于 500 (经过四舍五入)
dbtest@[local:/tmp]:1992=#92372 explain analyze select * from t1 where name = 'hello PG!' and id <= 505;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..21.00 rows=500 width=14) (actual time=0.016..0.103 rows=495 loops=1)
Filter: ((id <= 505) AND ((name)::text = 'hello PG!'::text))
Rows Removed by Filter: 505
Planning Time: 0.078 ms
Execution Time: 0.129 ms
(5 rows)
我们再来看一下,2个表连接的情况:
准备2张表, tt1和tt2 是 1对多的关系。
dbtest@[local:/tmp]:1992=#92372 insert into tt1 values (generate_series(1,100),'inner join tt1');
INSERT 0 100
dbtest@[local:/tmp]:1992=#92372 insert into tt2 values (generate_series(1,100),'inner join tt2');
INSERT 0 100
dbtest@[local:/tmp]:1992=#92372 insert into tt2 values (generate_series(1,100),'inner join tt2');
INSERT 0 100
dbtest@[local:/tmp]:1992=#92372 select count(1) from tt1;
count
-------
100
(1 row)
dbtest@[local:/tmp]:1992=#92372 select count(1) from tt2;
count
-------
200
(1 row)
dbtest@[local:/tmp]:1992=#92372 analyze tt1;
ANALYZE
dbtest@[local:/tmp]:1992=#92372 analyze tt2;
ANALYZE
我们来看一下 select * from tt1 , tt2 where tt1.id = tt2.id and tt1.name = ‘inner join tt1’; 的SQL 执行计划:
dbtest@[local:/tmp]:1992=#92372 explain analyze select * from tt1 , tt2 where tt1.id = tt2.id and tt1.name = 'inner join tt1';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Hash Join (cost=3.50..10.25 rows=200 width=38) (actual time=0.053..0.117 rows=200 loops=1)
Hash Cond: (tt2.id = tt1.id)
-> Seq Scan on tt2 (cost=0.00..4.00 rows=200 width=19) (actual time=0.013..0.033 rows=200 loops=1)
-> Hash (cost=2.25..2.25 rows=100 width=19) (actual time=0.032..0.033 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Seq Scan on tt1 (cost=0.00..2.25 rows=100 width=19) (actual time=0.008..0.021 rows=100 loops=1)
Filter: ((name)::text = 'inner join tt1'::text)
Planning Time: 0.104 ms
Execution Time: 0.152 ms
(9 rows)
查询一下: tt1和tt2 表的 null_frac 和(空值频率) n_distinct (NDV 的数量):
n_distinct 为 -1 的时候 , num_distinct 就是 reltuples = 100
n_distinct 为 -0.5 表示 数据有 0.5 (50%)是重复的,这个时候的num_distinct = reltuples * n_distinct = 200 * 0.5 = 100
(n_distinct 这个值 为正数的时候,代表num_distinct , 为负数的时候,代表重复率的百分比, -1的时候,代表 num_distinct = reltuples )
dbtest@[local:/tmp]:1992=#92372 SELECT tablename, null_frac,n_distinct FROM pg_stats WHERE tablename IN ('tt1', 'tt2') AND attname='id';
tablename | null_frac | n_distinct
-----------+-----------+------------
tt1 | 0 | -1
tt2 | 0 | -0.5
(2 rows)
计算公式为:
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
= (1-0)* (1-0) * min(1/(1100),1/(0.5200))
= 0.01
rows = (outer_cardinality * inner_cardinality) * selectivity
= (200 * 100) * 0.01
= 200
接下来,我们看一下 Extended Statistics 扩展统计信息, PG 里面也叫 multivariate statistics, 即为多列的统计信息。 这个功能是从PG 10 版本引进的
可以简单的理解为,如果一张表中有多个列是在业务上有相关联的业务联系,那么联合这几个列一起收集统计信息,对优化器生成最佳的执行计划提供准确的数据支持。
对于列的扩展统计信息,系统是不会自动收集的,需要的时候,可以手动创建。 扩展的统计信息 只要保存在 pg_statistic_ext 和pg_statistic_ext_data 中
命令如下:
CREATE STATISTICS stts (dependencies|ndistinct|mcv) ON col1, col2 FROM table;
大家可以看到 dependencies, ndistinct, mcv 这几个关键字,表示 col1,col2 在业务上或者生活常识上有一定的依赖关系。 用官方文档上的例子, 一张表中有 zipcode , city 这2个列。按照常识来说 如果知道city 和 zipcode 应该是 1对1 的关系,
也就是说 where 条件中 如果 city = ‘TJ’ and zipcode = ‘300100’ 可以得到一个选择率很高,row estimation (cardinality) 很低的统计信息
我们现在举一个实际的例子:
假设一下表 tt_ext 中 col1 和 col2 的业务联系 : 我们建立一个 col1 的值和 col2 的值 是 1比1 的关系
1)co11 = ‘hello’ 的时候, col2 = ‘PG’
2)co11 = ‘你好’ 的时候, col2 = ‘Postgres’
postgres@[local:/tmp]:1992=#46124 create table tt_ext (id int, col1 varchar(20), col2 varchar(20));
CREATE TABLE
postgres@[local:/tmp]:1992=#46124 insert into tt_ext values (generate_series(1,20),'hello','PG');
INSERT 0 20
postgres@[local:/tmp]:1992=#46124 insert into tt_ext values (generate_series(21,25),'你好','Postgres');
INSERT 0 5
postgres@[local:/tmp]:1992=#46124 analyze tt_ext;
ANALYZE
select * from tt_ext where col1 = ‘hello’ and col2=‘PG’ 的执行计划中, 我们看到的 rows = 16 的计算方式是
selectivity = selectivity(col1 = ‘hello’) * selectivity(col2=‘PG’)
= 0.8*0.8
= 0.64
rows = reltuples * selectivity
= 25 * 0.64
= 16
postgres-# WHERE tablename='tt_ext' AND attname in ('col1','col2');
-[ RECORD 1 ]----------+--------------
schemaname | public
tablename | tt_ext
attname | col1
inherited | f
null_frac | 0
avg_width | 6
n_distinct | 2
most_common_vals | {hello,你好}
most_common_freqs | {0.8,0.2}
histogram_bounds |
correlation | 1
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
-[ RECORD 2 ]----------+--------------
schemaname | public
tablename | tt_ext
attname | col2
inherited | f
null_frac | 0
avg_width | 4
n_distinct | 2
most_common_vals | {PG,Postgres}
most_common_freqs | {0.8,0.2}
histogram_bounds |
correlation | 1
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
postgres@[local:/tmp]:1992=#46124 explain analyze select * from tt_ext where col1 = 'hello' and col2='PG';
QUERY PLAN
---------------------------------------------------------------------------------------------------
Seq Scan on tt_ext (cost=0.00..1.38 rows=16 width=14) (actual time=0.021..0.032 rows=20 loops=1)
Filter: (((col1)::text = 'hello'::text) AND ((col2)::text = 'PG'::text))
Rows Removed by Filter: 5
Planning Time: 0.098 ms
Execution Time: 0.062 ms
(5 rows)
我们来手动的创建一下 col1和 col2的 multivariate statistics, 这里需要注意的是, 执行完 create statistics 的命令之后, 必须要执行一下 analyze 命令,才会真正的收集 扩展的统计信息。
关于扩展统计信息的查询可以通过 pg_statistic_ext 和 pg_statistic_ext_data 的表连接查询 或者 pg_stats_ext 直接查询
stxname: create statistic 指定的统计信息名称
stxkeys: 第2列和第3列: col1 和 col 2
stxddependencies: “2 => 3”: 1.000000 这个表示 col1 可以 100%决定 col2 的值
“3 => 2”: 1.000000 这个表示 col2 可以 100%决定 col1 的值
postgres@[local:/tmp]:1992=#46124 create statistics stats_ext_tt_col1_col2 (dependencies) on col1,col2 from tt_ext;
CREATE STATISTICS
postgres@[local:/tmp]:1992=#46124 analyze tt_ext;
ANALYZE
postgres@[local:/tmp]:1992=#46124 SELECT stxname, stxkeys, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stats_ext_tt_col1_col2';
stxname | stxkeys | stxddependencies
------------------------+---------+------------------------------------------
stats_ext_tt_col1_col2 | 2 3 | {"2 => 3": 1.000000, "3 => 2": 1.000000}
(1 row)
postgres@[local:/tmp]:1992=#46124 SELECT * FROM pg_stats_ext WHERE tablename = 'tt_ext';
-[ RECORD 1 ]----------+-----------------------------------------
schemaname | public
tablename | tt_ext
statistics_schemaname | public
statistics_name | stats_ext_tt_col1_col2
statistics_owner | postgres
attnames | {col1,col2}
exprs |
kinds | {f}
inherited | f
n_distinct |
dependencies | {"2 => 3": 1.000000, "3 => 2": 1.000000}
most_common_vals |
most_common_val_nulls |
most_common_freqs |
most_common_base_freqs |
这个时候再来看一下 : rows=20 是精准的了
由于 selectivity(col1 = ‘hello’) 是 可以决定 selectivity(col2=‘PG’)
或者说是 selectivity(col2=‘PG’) 是 可以决定 selectivity(col1 = ‘hello’)
所以2者的 selectivity 选一 即可
selectivity = selectivity(col1 = ‘hello’) | selectivity(col2=‘PG’)
= 0.8
rows = reltuples * selectivity
= 25 * 0.8
= 20
postgres@[local:/tmp]:1992=#46124 explain analyze select * from tt_ext where col1 = 'hello' and col2='PG';
QUERY PLAN
---------------------------------------------------------------------------------------------------
Seq Scan on tt_ext (cost=0.00..1.38 rows=20 width=14) (actual time=0.008..0.012 rows=20 loops=1)
Filter: (((col1)::text = 'hello'::text) AND ((col2)::text = 'PG'::text))
Rows Removed by Filter: 5
Planning Time: 0.160 ms
Execution Time: 0.026 ms
(5 rows)
接下来,我们看一下 Multivariate N-Distinct Counts , 简单地说就是收集多列的 number of distinct count 的统计信息:
我们还使用 之前的表 tt_ext :
postgres@[local:/tmp]:1992=#46124 select col1,col2 from tt_ext group by col1,col2;
col1 | col2
-------+----------
hello | PG
你好 | Postgres
(2 rows)
创建 Multivariate N-Distinct Counts 的统计信息如下:
n_distinct : {“2, 3”: 2} 表示 {col1,col2} 的 distinct number of value 是 2
postgres@[local:/tmp]:1992=#46124 create statistics stats_ext_tt_col1_col2_nvd (ndistinct) on col1,col2 from tt_ext;
CREATE STATISTICS
postgres@[local:/tmp]:1992=#46124 analyze tt_ext;
ANALYZE
postgres@[local:/tmp]:1992=#46124 select * from pg_stats_ext where tablename = 'tt_ext';
-[ RECORD 2 ]----------+-----------------------------------------
schemaname | public
tablename | tt_ext
statistics_schemaname | public
statistics_name | stats_ext_tt_col1_col2_nvd
statistics_owner | postgres
attnames | {col1,col2}
exprs |
kinds | {d}
inherited | f
n_distinct | {"2, 3": 2}
dependencies |
most_common_vals |
most_common_val_nulls |
most_common_freqs |
most_common_base_freqs |
接下来,我们看一下 Multivariate MCV Lists , 简单地说就是收集多列的 most common values 的信息进行统计:
most_common_vals: {{hello,PG},{你好,Postgres}} – 最常见的2组值
most_common_val_nulls:{{f,f},{f,f}} – f 表示不存在空值
most_common_freqs: {0.8,0.2} 2个列组合起来的概率: {hello,PG} 出现是概率是 80%, {你好,Postgres} 出现的概率是 20%
most_common_base_freqs:col1 的基础概率是 64% = 0.8 * 0.8 = 0.64 , col2的基础概率是 4% = 0.2*0.2 = 0.04
postgres@[local:/tmp]:1992=#46124 create statistics stats_ext_tt_col1_col2_mcv (mcv) on col1,col2 from tt_ext;
CREATE STATISTICS
postgres@[local:/tmp]:1992=#46124 analyze tt_ext;
ANALYZE
postgres@[local:/tmp]:1992=#95670 select * from pg_stats_ext where tablename = 'tt_ext';
-[ RECORD 1 ]----------+-----------------------------------------
schemaname | public
tablename | tt_ext
statistics_schemaname | public
statistics_name | stats_ext_tt_col1_col2_mcv
statistics_owner | postgres
attnames | {col1,col2}
exprs |
kinds | {m}
inherited | f
n_distinct |
dependencies |
most_common_vals | {{hello,PG},{你好,Postgres}}
most_common_val_nulls | {{f,f},{f,f}}
most_common_freqs | {0.8,0.2}
most_common_base_freqs | {0.6400000000000001,0.04000000000000001}
最后我们来总结一下 PG 表的统计信息:
1)基础的统计信息 可以通过 pg_stats, pg_class 来查询
2)扩展的统计信息 (多列的联合统计信息) 可以通过 pg_statistic_ext 和 pg_statistic_ext_data 的表连接查询 或者 pg_stats_ext 直接查询。支持如下几种:
Dependencies,Multivariate N-Distinct Count, Multivariate MCV Lists
Have a fun 🙂 !




