死元组数量居然不一样？莫非是BUG？非也

原创 xiongcc 2024-06-21

329

前言

前阵子，④群有位群友提供了一个有趣的现象，今儿得空看了一下。本地也成功复现出来了，如下

postgres=# select * from pgstattuple('tab4');
-[ RECORD 1 ]------+-------
table_len          | 688128
tuple_count        | 9990
tuple_len          | 609390
tuple_percent      | 88.56
dead_tuple_count   | 0     ---死元组为零
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 6416
free_percent       | 0.93

postgres=# select relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum from pg_stat_user_tables where relname = 'tab4';
-[ RECORD 1 ]---+------------------------------
relname         | tab4
n_live_tup      | 9990
n_dead_tup      | 10      ---死元祖还有10个
last_vacuum     | 
last_autovacuum | 2024-06-20 17:58:41.244272+08

可以看到，pgstattuple (精确统计) 的结果显示 dead_tuple_count 为 0，即没有死元组，但是 pg_stat_user_tables 却显示有 10 条死元组。有童鞋可能会说是不是统计信息的问题，不然，pg_stat_user_tables 实际上是基于 pg_stat_all_tables 的视图，是"精确值"，原理我已经写过很多篇文章了，此处不再过多赘述。那么问题出在哪了？分析一下。

复现

首先照着群友的步骤复现一下

create table tab4 (id int, name varchar(128));
insert into tab4 select id, md5(id::varchar) from generate_series(1,10000) as id;
select relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum from pg_stat_user_tables where relname = 'tab4';
select * from pgstattuple('tab4');
delete from tab4 where id <= 10;
create index concurrently idx_name4 on tab4(name);

当执行完第六步之后，便会出现上方的现象：pg_stat_user_tables 和 pgstattuple 显式的结果不一致。

另外，在第三步，各位可能会发现 last_autovacuum 有值 (手速快的话可能看不到，超过了 autovacuum_naptime)，不要奇怪，其实这是在 13 以后的新特性，在之前的版本，如果是纯 INSERT，是不会触发 vacuum 的，那么就有可能许久不触发 FREEZE，导致冷不丁来一个冻结炸弹，所以 13 之后提供了autovacuum_vacuum_insert_scale_factor 参数用于控制插入多少条触发 vacuum，当然这是主要目的，其次比如更新 pg_class.relallvisble，用于预估 index only scan 的成本等等。

回到正题，按照自动清理的触发逻辑，当表更新或者删除的元组数超过了 autovacuum_vacuum_threshold+ autovacuum_vacuum_scale_factor * pg_class.reltuples，很明显，这里才删除了 10 条，所以不会触发自动清理。因此关键就在于最后一步，做了一个 CIC。

当不清楚代码细节的情况下，又要用到老手段了，老样子，一股脑全部抓下来

[postgres@mypg ~]$ cat 16stp.out | sort | uniq | grep -i 'prune'
postgres: heap_page_prune
postgres: heap_page_prune_execute
postgres: heap_page_prune_opt
postgres: heap_prune_chain
postgres: heap_prune_record_dead
postgres: heap_prune_satisfies_vacuum

这下就很清楚了，又是我们熟知的页剪枝。heap_page_prune 用于删除页面中的过期死元组，遍历过程中，然后调用 heap_prune_chain 遍历 HOT 链，然后调用 heap_page_prune_execute 批量删除死元组。所以，如果你用 GDB 在 heap_prune_chain 打个断点的话，很容易就可以复现。看看注释就行了：

/*
 * Prune specified line pointer or a HOT chain originating at line pointer.
 *
 * If the item is an index-referenced tuple (i.e. not a heap-only tuple),
 * the HOT chain is pruned by removing all DEAD tuples at the start of the HOT
 * chain.  We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
 * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
 * DEAD, our visibility test is just too coarse to detect it.
 *
 * In general, pruning must never leave behind a DEAD tuple that still has
 * tuple storage.  VACUUM isn't prepared to deal with that case.  That's why
 * VACUUM prunes the same heap page a second time (without dropping its lock
 * in the interim) when it sees a newly DEAD tuple that we initially saw as
 * in-progress.  Retrying pruning like this can only happen when an inserting
 * transaction concurrently aborts.
 *
 * The root line pointer is redirected to the tuple immediately after the
 * latest DEAD tuple.  If all tuples in the chain are DEAD, the root line
 * pointer is marked LP_DEAD.  (This includes the case of a DEAD simple
 * tuple, which we treat as a chain of length 1.)
 *
 * We don't actually change the page here. We just add entries to the arrays in
 * prstate showing the changes to be made.  Items to be redirected are added
 * to the redirected[] array (two entries per redirection); items to be set to
 * LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED
 * state are added to nowunused[].
 *
 * Returns the number of tuples (to be) deleted from the page.
 */

为了确认我们的结论，使用 pageinspect 查看一下

postgres=# select lp, t_xmin, t_xmax, t_ctid,
       infomask(t_infomask, 1) as infomask,
       infomask(t_infomask2, 2) as infomask2
from heap_page_items(get_raw_page('tab4', 0)) limit 15;
 lp | t_xmin  | t_xmax | t_ctid |                infomask                 | infomask2 
----+---------+--------+--------+-----------------------------------------+-----------
  1 |         |        |        |                                         | 
  2 |         |        |        |                                         | 
  3 |         |        |        |                                         | 
  4 |         |        |        |                                         | 
  5 |         |        |        |                                         | 
  6 |         |        |        |                                         | 
  7 |         |        |        |                                         | 
  8 |         |        |        |                                         | 
  9 |         |        |        |                                         | 
 10 |         |        |        |                                         | 
 11 | 1616215 |      0 | (0,11) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH | 
 12 | 1616215 |      0 | (0,12) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH | 
 13 | 1616215 |      0 | (0,13) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH | 
 14 | 1616215 |      0 | (0,14) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH | 
 15 | 1616215 |      0 | (0,15) | XMAX_INVALID|XMIN_COMMITTED|HASVARWIDTH | 
(15 rows)

果然，前面的 10 条死元组都已经被清理掉了。而页剪枝不会去更新 pg_stat_all_tables (pgstat_update_heap_dead_tuples)，也不会刷新可见性映射和空闲空间映射 (为了更新留存)，所以二者之前有差异就不足为奇了。

小结

页剪枝是个比较复杂的东西，包括 HOT 链剪枝、索引页剪枝等等，感兴趣的读者可以参照《PostgreSQL 14 internals》中的例子进行学习。

postgresql

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

死元组数量居然不一样？莫非是BUG？非也

前言

复现

小结

评论