前言
前阵子,④群有位群友提供了一个有趣的现象,今儿得空看了一下。本地也成功复现出来了,如下
postgres=# select * from pgstattuple('tab4');
-[ RECORD 1 ]------+-------
table_len | 688128
tuple_count | 9990
tuple_len | 609390
tuple_percent | 88.56
dead_tuple_count | 0 ---死元组为零
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 6416
free_percent | 0.93
postgres=# select relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum from pg_stat_user_tables where relname = 'tab4';
-[ RECORD 1 ]---+------------------------------
relname | tab4
n_live_tup | 9990
n_dead_tup | 10 ---死元祖还有10个
last_vacuum |
last_autovacuum | 2024-06-20 17:58:41.244272+08
可以看到,pgstattuple (精确统计) 的结果显示 dead_tuple_count 为 0,即没有死元组,但是 pg_stat_user_tables 却显示有 10 条死元组。有童鞋可能会说是不是统计信息的问题,不然,pg_stat_user_tables 实际上是基于 pg_stat_all_tables 的视图,是"精确值",原理我已经写过很多篇文章了,此处不再过多赘述。那么问题出在哪了?分析一下。
复现
首先照着群友的步骤复现一下
create table tab4 (id int, name varchar(128)); insert into tab4 select id, md5(id::varchar) from generate_series(1,10000) as id; select relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum from pg_stat_user_tables where relname = 'tab4'; select * from pgstattuple('tab4'); delete from tab4 where id <= 10; create index concurrently idx_name4 on tab4(name);
当执行完第六步之后,便会出现上方的现象:pg_stat_user_tables 和 pgstattuple 显式的结果不一致。
另外,在第三步,各位可能会发现 last_autovacuum 有值 (手速快的话可能看不到,超过了 autovacuum_naptime),不要奇怪,其实这是在 13 以后的新特性,在之前的版本,如果是纯 INSERT,是不会触发 vacuum 的,那么就有可能许久不触发 FREEZE,导致冷不丁来一个冻结炸弹,所以 13 之后提供了autovacuum_vacuum_insert_scale_factor 参数用于控制插入多少条触发 vacuum,当然这是主要目的,其次比如更新 pg_class.relallvisble,用于预估 index only scan 的成本等等。

回到正题,按照自动清理的触发逻辑,当表更新或者删除的元组数超过了 autovacuum_vacuum_threshold+ autovacuum_vacuum_scale_factor * pg_class.reltuples,很明显,这里才删除了 10 条,所以不会触发自动清理。因此关键就在于最后一步,做了一个 CIC。
当不清楚代码细节的情况下,又要用到老手段了,老样子,一股脑全部抓下来
[postgres@mypg ~]$ cat 16stp.out | sort | uniq | grep -i 'prune'
postgres: heap_page_prune
postgres: heap_page_prune_execute
postgres: heap_page_prune_opt
postgres: heap_prune_chain
postgres: heap_prune_record_dead
postgres: heap_prune_satisfies_vacuum
这下就很清楚了,又是我们熟知的页剪枝。heap_page_prune 用于删除页面中的过期死元组,遍历过程中,然后调用 heap_prune_chain 遍历 HOT 链,然后调用 heap_page_prune_execute 批量删除死元组。所以,如果你用 GDB 在 heap_prune_chain 打个断点的话,很容易就可以复现。看看注释就行了:
/*
* Prune specified line pointer or a HOT chain originating at line pointer.
*
* If the item is an index-referenced tuple (i.e. not a heap-only tuple),
* the HOT chain is pruned by removing all DEAD tuples at the start of the HOT
* chain. We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
* This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
* DEAD, our visibility test is just too coarse to detect it.
*
* In general, pruning must never leave behind a DEAD tuple that still has
* tuple storage. VACUUM isn't prepared to deal with that case. That's why
* VACUUM prunes the same heap page a second time (without dropping its lock
* in the interim) when it sees a newly DEAD tuple that we initially saw as
* in-progress. Retrying pruning like this can only happen when an inserting
* transaction concurrently aborts.
*
* The root line pointer is redirected to the tuple immediately after the
* latest DEAD tuple. If all tuples in the chain are DEAD, the root line
* pointer is marked LP_DEAD. (This includes the case of a DEAD simple
* tuple, which we treat as a chain of length 1.)
*
* We don't actually change the page here. We just add entries to the arrays in
* prstate showing the changes to be made. Items to be redirected are added
* to the redirected[] array (two entries per redirection); items to be set to
* LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED
* state are added to nowunused[].
*
* Returns the number of tuples (to be) deleted from the page.
*/
为了确认我们的结论,使用 pageinspect 查看一下
postgres=# select lp, t_xmin, t_xmax, t_ctid,
infomask(t_infomask, 1) as infomask,
infomask(t_infomask2, 2) as infomask2
from heap_page_items(get_raw_page('tab4', 0)) limit 15;
lp | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+---------+--------+--------+-----------------------------------------+-----------
1 | | | | |
2 | | | | |
3 | | | | |
4 | | | | |
5 | | | | |
6 | | | | |
7 | | | | |
8 | | | | |
9 | | | | |
10 | | | | |
11 | 1616215 | 0 | (0,11) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH |
12 | 1616215 | 0 | (0,12) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH |
13 | 1616215 | 0 | (0,13) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH |
14 | 1616215 | 0 | (0,14) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH |
15 | 1616215 | 0 | (0,15) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH |
(15 rows)
果然,前面的 10 条死元组都已经被清理掉了。而页剪枝不会去更新 pg_stat_all_tables (pgstat_update_heap_dead_tuples),也不会刷新可见性映射和空闲空间映射 (为了更新留存),所以二者之前有差异就不足为奇了。
小结
页剪枝是个比较复杂的东西,包括 HOT 链剪枝、索引页剪枝等等,感兴趣的读者可以参照《PostgreSQL 14 internals》中的例子进行学习。




