PG vs MySQL 统计信息收集的异同

原创进击的CJR 云和恩墨技术服务团队 2025-02-05

354

统计信息的作用

对于一条SQL，数据库选择何种方式执行，需要根据统计信息进行估算，计算出代价最低的执行计划。收集统计信息主要是为了让优化器做出正确的判断，选择最佳的执行计划。

PG的统计信息收集

PG的统计信息相关表

在PostgreSQL里面，统计信息存放于pg_statistics系统表中，由于pg_statistics里面的内容人为不易阅读，因此便有了pg_stats视图。

pg_class看pages和tuples

postgres=# select relname,relpages,reltuples::bigint from pg_class where relname='test'\gx
-[ RECORD 1 ]-----
relname   | test
relpages  | 443
reltuples | 100000

pg_stat_all_tables看活元组、死元组，上次统计信息收集时间

postgres=# select * from pg_stat_all_tables where relname='test'\gx
-[ RECORD 1 ]-------+------------------------------
relid               | 16388
schemaname          | public
relname             | test
seq_scan            | 0
last_seq_scan       | 
seq_tup_read        | 0
idx_scan            | 
last_idx_scan       | 
idx_tup_fetch       | 
n_tup_ins           | 100000
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_tup_newpage_upd   | 0
n_live_tup          | 100000
n_dead_tup          | 0
n_mod_since_analyze | 0
n_ins_since_vacuum  | 0
last_vacuum         | 
last_autovacuum     | 2025-01-21 10:46:51.330118+08
last_analyze        | 
last_autoanalyze    | 2025-01-21 10:46:51.353753+08
vacuum_count        | 0
autovacuum_count    | 1
analyze_count       | 0
autoanalyze_count   | 1

pg_stats看列的统计信息

\d pg_stats
                     View "pg_catalog.pg_stats"
         Column         |   Type   | Collation | Nullable | Default 
------------------------+----------+-----------+----------+---------
 schemaname             | name     |           |          | 
 tablename              | name     |           |          | 
 attname                | name     |           |          | 
 inherited              | boolean  |           |          | ---是否是继承列
 null_frac              | real     |           |          | ---null空值的比率
 avg_width              | integer  |           |          | ---平均宽度，字节
 n_distinct             | real     |           |          | ---大于零就是非重复值的数量，小于零则是非重复值的个数除以行数
 most_common_vals       | anyarray |           |          | ---高频值
 most_common_freqs      | real[]   |           |          | ---高频值的频率
 histogram_bounds       | anyarray |           |          | ---直方图
 correlation            | real     |           |          | ---物理顺序和逻辑顺序的关联性
 most_common_elems      | anyarray |           |          | ---高频元素，比如数组
 most_common_elem_freqs | real[]   |           |          |  ---高频元素的频率
 elem_count_histogram   | real[]   |           |          |  ---直方图（元素）

PG自动收集统计信息

• 触发vacuum analyze–>
• 表上新增(insert,update,delte) >= autovacuum_analyze_scale_factor* reltuples(表上记录数) + autovacuum_analyze_threshold

postgres=# show autovacuum_analyze_scale_factor;
 autovacuum_analyze_scale_factor 
---------------------------------
 0.1
(1 row)

postgres=# show autovacuum_analyze_threshold;
 autovacuum_analyze_threshold 
------------------------------
 50
(1 row)

PG手动收集统计信息

手动收集统计信息的命令是analyze命令，analyze的语法格式：

analyze [verbose] [table[(column[,…])]]

verbose：显示处理的进度，以及表的一些统计信息

table：要分析的表名，如果不指定，则对整个数据库中的所有表作分析 

column：要分析的特定字段的名字默认是分析所有字段

analyze 命令 会在表上加读锁，不影响表上其它SQL并发执行，对于大表只会读取表中部分数据。

MySQL的统计信息收集

MySQL的统计信息相关表

• 收集的表的统计信息存放在mysql数据库的innodb_table_stats表中。
• 索引的统计信息存放在mysql数据库的innodb_index_stats表中。

mysql>  select * from mysql.innodb_table_stats where table_name='actor';
+---------------+------------+---------------------+--------+----------------------+--------------------------+
| database_name | table_name | last_update         | n_rows | clustered_index_size | sum_of_other_index_sizes |
+---------------+------------+---------------------+--------+----------------------+--------------------------+
| sakila        | actor      | 2025-01-21 16:06:31 |    200 |                    1 |                        1 |
+---------------+------------+---------------------+--------+----------------------+--------------------------+
1 row in set (0.00 sec)

mysql>  select * from mysql.innodb_index_stats where table_name='actor'; 
+---------------+------------+---------------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name | index_name          | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+------------+---------------------+---------------------+--------------+------------+-------------+-----------------------------------+
| sakila        | actor      | PRIMARY             | 2025-01-21 16:06:31 | n_diff_pfx01 |        200 |           1 | actor_id                          |
| sakila        | actor      | PRIMARY             | 2025-01-21 16:06:31 | n_leaf_pages |          1 |        NULL | Number of leaf pages in the index |
| sakila        | actor      | PRIMARY             | 2025-01-21 16:06:31 | size         |          1 |        NULL | Number of pages in the index      |
| sakila        | actor      | idx_actor_last_name | 2025-01-21 16:06:31 | n_diff_pfx01 |        121 |           1 | last_name                         |
| sakila        | actor      | idx_actor_last_name | 2025-01-21 16:06:31 | n_diff_pfx02 |        200 |           1 | last_name,actor_id                |
| sakila        | actor      | idx_actor_last_name | 2025-01-21 16:06:31 | n_leaf_pages |          1 |        NULL | Number of leaf pages in the index |
| sakila        | actor      | idx_actor_last_name | 2025-01-21 16:06:31 | size         |          1 |        NULL | Number of pages in the index      |
+---------------+------------+---------------------+---------------------+--------------+------------+-------------+-----------------------------------+
7 rows in set (0.00 sec)

MySQL自动收集统计信息

• innodb_stats_persistent
是否把统计信息持久化。
对应表选项STATS_PERSISTENT

• innodb_stats_auto_recalc
当一个表的数据变化超过10%时是否自动收集统计信息，两次统计信息收集之间时间间隔不能少10秒。
对应的表选项STATS_AUTO_RECALC

• innodb_stats_on_metadata：其触发条件是表的元数据发生变化，如执行 ALTER TABLE 等操作修改表结构时，会触发统计信息的自动更新。

• innodb_stats_persistent_sample_pages
统计索引时的抽样页数，这个值设置得越大，收集的统计信息越准确，但收集时消耗的资源越大。
对应的表选项STATS_SAMPLE_PAGES

mysql> show variables like 'innodb_stat%';
+--------------------------------------+-------------+
| Variable_name                        | Value       |
+--------------------------------------+-------------+
| innodb_stats_auto_recalc             | ON          |
| innodb_stats_include_delete_marked   | OFF         |
| innodb_stats_method                  | nulls_equal |
| innodb_stats_on_metadata             | OFF         |
| innodb_stats_persistent              | ON          |
| innodb_stats_persistent_sample_pages | 20          |
| innodb_stats_transient_sample_pages  | 8           |
| innodb_status_output                 | OFF         |
| innodb_status_output_locks           | OFF         |
+--------------------------------------+-------------+
9 rows in set (0.00 sec)

对应的表选项可以这样设置
alter table actor stats_auto_recalc=0;

MySQL手动收集统计信息

analyze local table actor,rental;

analyze table 加MDL读锁，不影响DML的并行操作。

PG vs MySQL

在自动收集统计信息的方法上，PG比MySQL更加灵活，例如在表统计信息更新触发条件上， PG可以通过调整autovacuum_analyze_scale_factor的大小，来调整更新触发条件的数据量比例，而MySQL只能是10%，而且，因为PG还有autovacuum_analyze_threshold这个最小更新量保护机制，避免小表被频发触发统计信息收集影响性能。

在手动收集统计信息的方式上，PG和MySQL类似，都会加上读锁，MySQL加元数据读锁，不影响DML并行，PG加共享更新独占（SHARE UPDATE EXCLUSIVE），也不影响DML并行。

另外PG统计信息收集还有两个优势

统计信息的精度
MySQL统计信息的精度相对较低，尤其是在数据量较大且分布不均匀的情况下，可能无法准确地反映数据的实际情况，从而影响查询优化器的选择；而PG除了包含与 MySQL 类似的基本统计信息外，还提供了更丰富的统计内容，如多字段统计信息，可以对多个列的组合进行统计分析，为复杂的查询提供更精确的优化依据。

对性能的影响
MySQL自动收集统计信息可能会在一定程度上增加系统的负载，尤其是在数据量较大且修改频繁的情况下；而PostgreSQL的autovacuum 进程在后台自动运行，对系统性能的影响相对较小。但在进行大规模的数据操作或系统负载较高时，可能会导致一定的性能波动。

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者