PostgreSQL查询—单表查询的成本估算

原创学习笔记 2023-06-09

1571

PostgreSQL查询—单表查询的成本估算

PG的查询优化是基于成本的。成本是一个无量纲的值，他不是一种绝对的性能指标，但可以作为比较各种操作成本时的相对性能指标。

costsize.c中函数用于估算各种操作的成本。所有被执行器执行的操作都有着相应的成本函数。例如，函数cost_seqscan()和cost_index()分别用于估算顺序扫描和索引扫描的成本。

在PG中有三种成本，分别是启动成本、运行成本和总成本。总成本是穷成本和运行成本的和，因此只有启动成本和运行成本是单独估算的。

1、启动成本：在读取到第一条元组前花费的成本，比如索引扫描节点的启动成本就是读取目标表的索引页，获取到第一个元组的成本。

2、运行成本：获取全部元组的成本。

3、总成本：前两者之和。

EXPLAIN命令显示了每个操作的启动成本和总成本，下面是一个简单的例子：

1、postgres=# explain select * from tb1;
2、                       QUERY PLAN                        
3、---------------------------------------------------------
4、 Seq Scan on tb1  (cost=0.00..145.00 rows=10000 width=8)
5、(1 row)

postgres=#

第4行显示了顺序扫描的相关信息。成本部分包含了0.00和145.00两个值。在本例中，启动成本和总成本分别是0.00和145.00。

这里我们将详细介绍顺序扫描，索引扫描和排序操作的成本是如何估算的。

在接下来的内容中，我们使用下面这个表及其索引作为例子。

postgres=# CREATE TABLE tb1 (id int PRIMARY KEY, data int);
postgres=# CREATE INDEX tb1_data_idx ON tb1 (data);
postgres=# INSERT INTO tb1 SELECT generate_series(1,10000),generate_series(1,10000);
CREATE TABLE
postgres=# CREATE INDEX tb1_data_idx ON tb1 (data);
CREATE INDEX
postgres=# INSERT INTO tb1 SELECT generate_series(1,10000),generate_series(1,10000);
INSERT 0 10000
postgres=# ANALYZE;
ANALYZE
postgres=# \d tb1
                Table "public.tb1"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           | not null | 
 data   | integer |           |          | 
Indexes:
    "tb1_pkey" PRIMARY KEY, btree (id)
    "tb1_data_idx" btree (data)

postgres=#

顺序扫描

顺序扫描的成本是通过函数cost_seqscan(()计算的。接下来我们一起研究顺序扫描成本是如何计算的，以下面的查询为例：

select * from tb1 where id<8000;

在顺序扫描中，启动成本等于0，而运行成本由以下公式定义：

run_cost = cpu_run_cost + disk_run_cost
		 = (cpu_tuple_cost + cpu_operator_cost) * Ntuple + seq_page_cost * Npage

其中seq_page_cost、cpu_tuple_cost和cpu_operator_cost是在postgresql.conf中配置的参数，默认值分别为1.0、0.01、0.0025。

postgres=# select name,setting from pg_settings where name in ('seq_page_cost','cpu_tuple_cost','cpu_operator_cost'); 
       name        | setting 
-------------------+---------
 cpu_operator_cost | 0.0025
 cpu_tuple_cost    | 0.01
 seq_page_cost     | 1
(3 rows)

postgres=#

Ntuple和Npage分别是表中的元组总数和页面总数，这两个值可以使用以下查询获取，Ntuple=10000，Npage=45。

postgres=# select relpages,reltuples from pg_class where relname='tb1';
 relpages | reltuples 
----------+-----------
       45 |     10000
(1 row)

postgres=#

因此：

run_cost = cpu_run_cost + disk_run_cost
		 = (cpu_tuple_cost + cpu_operator_cost) * Ntuple + seq_page_cost * Npage
		 = (0.01 + 0.0025) * 10000 + 1.0 * 45
		 = 125.00 + 45.00 = 170.00

最终：

total_cost = 0.00 + 170.00 = 170.00

作为验证，下面是该查询的EXPLAIN结果：

1、postgres=# explain select * from tb1 where id<5000;
2、                       QUERY PLAN                       
3、--------------------------------------------------------
4、 Seq Scan on tb1  (cost=0.00..170.00 rows=4999 width=8)
5、   Filter: (id < 5000)
6、(2 rows)

postgres=#

在第4行中可以看到，启动成本和总成本分别是0.00和170.00，且预计全表扫描返回行数为4999条（元组）。

第5行显示了一个顺序扫描的过滤器Filter: (id < 5000)。更精确的说，它是一个表级过滤谓词。注意，这种类型的过滤器只会在读取所有元组的时候使用，他并不会减少需要扫描的表页面数量。

从优化运行成本的角度来看，PostgreSQL假设所有的物理页都是从存储介质中获取的，即PostgreSQL不会考虑扫描的页面是否来自共享缓冲区。

索引扫描

尽管PostgreSQL支持很多索引方法，比如B树、GiST、GIN和BRIN，但是索引扫描的成本计算是使用一个共同的成本函数：cost_index()。

下面将研究索引扫描的成本是如何计算的，以下列查询为例。

select id,data from tb1 where data<240;

在计算该查询的成本之前，下面的查询能获取Nindex-page和Nindex-tuple的值：

postgres=# select relpages,reltuples from pg_class where relname='tb1_data_idx';
 relpages | reltuples 
----------+-----------
       30 |     10000
(1 row)

postgres=#

得出：Nindex-page = 30，Nindex-tuple=10000

启动成本

索引扫描的启动成本就是读取索引页以访问目标表的第一条元组的成本，由下面的公式定义：

start-up_cost = {ceil (log2 (Nindx-tuple)) + (Hindex+1) x 50} x cpu_operator_cost

其中Hindex是索引树的高度。

postgres=# select * from bt_metap('tb1_data_idx');
 magic  | version | root | level | fastroot | fastlevel | last_cleanup_num_delpages | last_cleanup_num_tuples | allequalimage 
--------+---------+------+-------+----------+-----------+---------------------------+-------------------------+---------------
 340322 |       4 |    3 |     1 |        3 |         1 |                         0 |                      -1 | t
(1 row)

postgres=#

Hindex=1，Nindex-tuple=10000，cpu_operator_cost=0.0025。

因此：

start-up_cost = {ceil (log2(10000)) + (1+1) x 50} x 0.0025
			  = {ceil (13.2877) + 100 } * 0.0025
			  = (14 + 100 )* 0.0025
			  = 114 * 0.0025 = 0.285

运行成本

索引扫描的运行成本是表和索引的cpu_cost和io_cost之和。

run_cost = (index_cpu_cost + table_cpu_cost) + (index_io_cost + table_io_cost)

如果使用仅索引扫描，则不会计算table_cpu_cost与table_io_cost。

前三个成本（即index_cpu_cost、table_cpu_cost和index_io_cost）如下所示：

index_cpu_cost = selectivity x Nindex-tuple x (cpu_index_tuple_cost+ qual_op_cost)
table_cpu_cost = selectivity x Ntuple x cpu_tuple_cost
index_io_cost = ceil(selectivity x Nindex-page) x random_page_cost

以上公式中高度cpu_index_tuple_cost和random_page_cost在postgresql.conf中配置，默认值分别为0.005和4.0。其中qual_op_costt粗略的讲是索引求值的成本（the evaluation cost of index），默认值0.0025，这里取自源码计算方式。选择率是一个0-1之间的浮点数，代表查询指定的where子句在索引中搜索范围的比例。例如，（selectivity x Ntuple）就是需要读取表元组数量，（selectivity x Nindex-tuple）就是需要读取索引元组数量，诸如此类。

其中选择率selectivity计算如下

postgres=# select histogram_bounds from pg_stats where tablename = 'tb1' and attname = 'data';

{1,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4300,4400,4500,4600,4700,480
0,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,7600,7700,7800,7900,8000,8100,8200,8300,8400,8500,8600,8700,8800,8900,9000,9100,9200,9300,9400,9
500,9600,9700,9800,9900,10000}
(1 row)

postgres=#

由于查询带有where子句data<240，而值239落在第二个桶中，通过线性插值推算出相应的选择率。因此查询中data列的选择率可以套用下面的公式计算：

selectivity = (2 + (239-hb[2]) / (hb[3]-hb[2])) / 100
			= (2 + (239-200) / (300-200)) / 100
			= (2 + 40 / 100) / 100
			= 0.0239

因此：

index_cpu_cost = selectivity x Nindex-tuple x (cpu_index_tuple_cost+ qual_op_cost)
			   = 0.0239 x 10000 x (0.005+0.0025)
			   = 1.8
table_cpu_cost = selectivity x Ntuple x cpu_tuple_cost
			   = 0.0239 x 10000 x 0.01
			   = 2.39
index_io_cost = ceil(selectivity x Nindex-page) x random_page_cost
			  = ceil(0.0239 x 30) x 4
			  = 1 x 4 
			  = 4

table_io_cost由下面的公式定义：

table_io_cost = max_io_cost + indexCorerelatition² x (min_io_cost - max_io_cost)

max_io_cost是最差情况下的I/O成本，即随机扫描所有数据页的成本。这个成本由以下公式定义：

max_io_cost = Npage x random_page_cost
			= 45 x 4.0
			= 180.0

min_io_cost是最优情况吓得I/O成本，即顺序扫描选定的数据页。这个成本由以下公式定义：

min_io_cost = 1 x random_page_cost + (ceil(selectivity x Npage) -1) x seq_page_cost
			= 1 x 4.0 + (ceil(0.0239 x 45) -1 x 1.0 
			= 4.0 + (2 - 1) x 1.0
			= 5.0

indexCorrelation—索引相关性

Index correlation is a statistical correlation between physical row ordering and logical ordering of the column values (cited from the official document). This ranges from −1−1 to +1+1.

索引相关性是列值在物理上的顺序和逻辑上的顺序的统计相关性。索引相关性的取值范围从-1到1。索引相关性是一种统计上的相关性。在索引扫描成本计算中，索引相关性体现了索引顺序和物理元组顺序扭曲程度给随机访问性能造成的影响大小。

这里indexCorrelation = 1.0

因此：

table_io_cost = max_io_cost + indexCorerelatition² x (min_io_cost - max_io_cost)
			  = 180.0 + 1² x (5.0 - 180.0)
			  = 5.0

最终：

run_cost = (index_cpu_cost + table_cpu_cost) + (index_io_cost + table_io_cost)
		 = (1.8 + 2.39) + (4.0 + 5.0)
		 = 13.19

整体成本

total_cost = start-up_cost + run_cost
		   = 0.285 + 13.19
		   = 13.475

作为确认，上述select查询的EXPLAIN结果如下所示：

1、postgres=# explain select * from tb1 where data<240;
2、                                QUERY PLAN                                 
3、---------------------------------------------------------------------------
4、 Index Scan using tb1_data_idx on tb1  (cost=0.29..13.47 rows=239 width=8)
5、   Index Cond: (data < 240)
6、(2 rows)

postgres=#

在第4行可以看到启动成本和总成本分别是0.29和13.47，预估有239条元组被扫描。

在第5行显示了一个索引条件Index Cond: (data < 240)。更准确的说，这个条件叫做谓词访问，它表达了索引扫描的开始条件与结束条件。

排序

排序路径会在排序操作中被使用。排序操作包括ORDER BY、归并连接的预处理操作，以及其他函数。函数cost_sort()用于计算排序操作的成本。

如果能在工作内存中犯下所有元组，那么排序操作会选用快速排序算法。否则就会创建临时文件，使用文件归并排序算法。

排序路径的启动成本就是对目标表的排序成本，因此成本就是O(Nsort) x log2(Nsort)，这里Nsort就是待排序的元组数。排序路径的运行成本就是读取已经排好序的元组的成本，因而成本就是O(Nsort)。

下面将研究以下查询排序成本的计算过程。假设该查询只使用工作内存，不使用临时文件。

select id,data from tb1 where data<240 order by id;

在本例中，启动成本由一下公式定义：

start-up_cost = C + comparison_cost x Nsortxlog2(Nsort)

这里的C就是上一次扫描的总成本，即上次索引扫描的总成本，为13.475，

Nsort=240，comparison_cost定义为 2 x cpu_operator_cost。因此：

start-up_cost = C + comparison_cost x Nsort x log2(Nsort)
			  = C + 2 x cpu_operator_cost x Nsort x log2(Nsort)
			  = 13.475 + (2 x 0.0025) x 240.0 x log2(240)
			  = 13.475 + 9.488
			  = 22.963

运行成本是在内存中读取排好序的元组的成本，即：

run_cost = cpu_operator_cost x Nsort
		 = 0.0025 x 240
		 = 0.6

最终：

total_cost = start-up_cost + run_cost
		   = 22.963 + 0.6
		   = 23.563

作为确认，以上select查询的EXPLAIN命令结果如下：

1、postgres=# explain select id,data from tb1 where data<240 order by id;
2、                                   QUERY PLAN                                    
3、---------------------------------------------------------------------------------
4、 Sort  (cost=22.91..23.51 rows=239 width=8)
5、   Sort Key: id
6、   ->  Index Scan using tb1_data_idx on tb1  (cost=0.29..13.47 rows=239 width=8)
7、         Index Cond: (data < 240)
8、(4 rows)

postgres=#

在第4行可以看到启动成本和运行成本分别为22.91和23.57。

墨力计划查询成本估算 postgresql

最后修改时间：2023-06-09 17:55:11

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

PostgreSQL分享—李先生（共53篇）

日常的PostgreSQL分享，包含运维、升级、安装部署、疑难问题等

PostgreSQL查询—单表查询的成本估算

PostgreSQL查询—单表查询的成本估算

顺序扫描

索引扫描

启动成本

运行成本

整体成本

排序

文章被以下合辑收录

评论