PolarDB PostgreSQL vacuum数据清理

PolarDB 2025-03-24

274

PolarDB PostgreSQL vacuum数据清理

关于 PolarDB PostgreSQL 版

PolarDB PostgreSQL 版是一款阿里云自主研发的云原生关系型数据库产品，100% 兼容 PostgreSQL，高度兼容Oracle语法；采用基于 Shared-Storage 的存储计算分离架构，具有极致弹性、毫秒级延迟、HTAP 、Ganos全空间数据处理能力和高可靠、高可用、弹性扩展等企业级数据库特性。同时，PolarDB PostgreSQL 版具有大规模并行计算能力，可以应对 OLTP 与 OLAP 混合负载。

VACUUM

PostgreSQL通过MVCC多版本实现并发控制，每个写操作都会创建一个新版本的数据项，并保留其旧版本。当事务读取数据对象时，系统会选择基于该事务快照可见的一个版本，通过这种方式来确保各个事务间的相互隔离。当旧版本数据项对系统中的所有事务均不可见时，vacuum机制会对这些死元组进行清理，从而避免数据膨胀。对特定表的自动清理由autovacuum worker进行，清理过程主要包括：

移除死元组：移除每个page中的dead tuple及指向dead tuple的index tuple；
freeze旧的事务标识：freeze旧元组的事务标识，用于防止xid回卷；
更新表的空闲空间映射FSM和可见性映射VM，判断是否需要truncate heap最后的空页，根据freeze后得到的datfrozenxid删除不再需要的clog等事务文件。

以下对第一部分清理死元组的过程进行详细介绍。

Page结构

PostgreSQL中每个表对应page的结构如下所示，由PageHeaderData + line pointer + tuple + special space四部分组成

/*
 * A postgres disk page is an abstraction layered on top of a postgres
 * disk block (which is simply a unit of i/o, see block.h).
 *
 * specifically, while a disk block can be unformatted, a postgres
 * disk page is always a slotted page of the form:
 *
 * +----------------+---------------------------------+
 * | PageHeaderData | linp1 linp2 linp3 ...           |
 * +-----------+----+---------------------------------+
 * | ... linpN |  |
 * +-----------+--------------------------------------+
 * |   ^ pd_lower  |
 * |  |
 * | v pd_upper  |
 * +-------------+------------------------------------+
 * | | tupleN ...                         |
 * +-------------+------------------+-----------------+
 * |   ... tuple3 tuple2 tuple1 | "special space" |
 * +--------------------------------+-----------------+
 * ^ pd_special
 *
 * a page is full when nothing can be added between pd_lower and
 * pd_upper
 */

其中PageHeaderData结构如下：

typedef struct PageHeaderData
{
 /* XXX LSN is member of *any* block, not only page-organized ones */
 PageXLogRecPtr pd_lsn; * LSN: next byte after last byte of xlog
 * record for last change to this page */
 uint16 pd_checksum; * checksum */
 uint16 pd_flags; * flag bits, see below */
 LocationIndex pd_lower; * offset to start of free space */
 LocationIndex pd_upper; * offset to end of free space */
 LocationIndex pd_special; * offset to start of special space */
 uint16 pd_pagesize_version;
 TransactionId pd_prune_xid; * oldest prunable XID, or zero if none */
 ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; * line pointer array */
} PageHeaderData;

pd_lsn：为最近一次修改该page所生成的xlog record对应的LSN
pd_checksum：该page的checksum，用于正确性校验
pd_flags：标识该页面是否有剩余空间，以及page中的所有tuple是否对所有事务均可见
pd_lower，pd_upper：分别指向空闲空间的开始及结束位置
pd_special：在索引页中会用到该字段，指向特殊空间的起始位置
pd_prune_xid：该page中所有可被prune的tuple对应xid的最小值
pd_linp：对应行指针的数组，行指针line pointer对应的数据结构如下：

typedef struct ItemIdData
{
 unsigned lp_off:15, * offset to tuple (from start of page) */
 lp_flags:2, * state of line pointer, see below */
 lp_len:15; * byte length of tuple */
} ItemIdData;

/*
 * lp_flags has these possible states.  An UNUSED line pointer is available
 * for immediate re-use, the other states are not.
 */
#define LP_UNUSED 0 * unused (should always have lp_len=0) */
#define LP_NORMAL 1 * used (should always have lp_len>0) */
#define LP_REDIRECT 2 * HOT redirect (should have lp_len=0) */
#define LP_DEAD 3 * dead, may or may not have storage */

lp_off：该line pointer指向的tuple在该page中的偏移
lp_flags：该行指针的状态：

LP_UNUSED：该行指针未被使用，在插入数据时可以被新元组复用；
LP_NORMAL：该行指针正常被使用
LP_REDIRECT：用于HOT链的重定位，对应的lp_off记录了下一个行指针对应的偏移
LP_DEAD：该行指针指向的元组为死元组，LP_DEAD类型的行指针不能被重用

lp_len：该行指针对应的tuple的长度

数据清理

了解了PostgreSQL中page的基本组织结构后，再看vacuum是如何对每个page进行清理的。对特定表的清理由函数vacuum_rel
完成，vacuum_rel
主要调用table_relation_vacuum
进行具体的清理工作，最终调用的函数为heap_vacuum_rel -》lazy_scan_heap
，主要过程如下：

依次扫描该表的每个page，根据VM bit判断是否可以跳过对该page的扫描，若标记为all visible/all frozen，说明该page中的tuple对所有事务均可见，无需进行清理；
通过lazy_scan_prune
获取对应page中需要清理的死元组，依据lazy_scan_prune
的结果判断是否需要设置或清理当前page的VISIBILITYMAP_ALL_VISIBLE及VISIBILITYMAP_ALL_FROZEN bit；
基于lazy_scan_prune
得到的死元组，通过lazy_vacuum
执行具体的清理工作。

lazy_scan_prune

lazy_scan_prune
调用heap_page_prune
进行实际的清理工作。如下，heap_page_prune
会扫描该page中的所有tuple，通过heap_prune_chain
遍历其HOT链，判断对应的tuple是否为dead tuple，并记录下需要设置为LP_UNUSED、LP_REDIRECT及LP_DEAD的line pointer；然后调用heap_page_prune_execute
处理这些元组。

int
heap_page_prune(Relation relation, Buffer buffer, ......)
{
 int ndeleted = 0;
 // 省略......

 /* Scan the page */
 for (offnum = FirstOffsetNumber;
 offnum <= maxoff;
 offnum = OffsetNumberNext(offnum))
 {
 ItemId itemid;

 /* Ignore items already processed as part of an earlier chain */
 if (prstate.marked[offnum])
 continue;

 /* see preceding loop */
 if (off_loc)
 *off_loc = offnum;

 /* Nothing to do if slot is empty or already dead */
 itemid = PageGetItemId(page, offnum);
 if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
 continue;

 // 记录需要设置为LP_UNUSED、LP_REDIRECT及LP_DEAD的item
        * Process this item or chain of items */
 ndeleted += heap_prune_chain(buffer, offnum, &prstate);
 }

 /* Have we found any prunable items? */
 if (prstate.nredirected > 0 || prstate.ndead > 0 || prstate.nunused > 0)
 {
 /*
 * Apply the planned item changes, then repair page fragmentation, and
 * update the page's hint bit about whether it has free line pointers.
 */
   通过heap_page_prune_execute 批量处理这些items
        heap_page_prune_execute(buffer,
 prstate.redirected, prstate.nredirected,
 prstate.nowdead, prstate.ndead,
 prstate.nowunused, prstate.nunused);
        MarkBufferDirty(buffer);
         生成对应的wal日志，省略......
 }
     省略......
 return ndeleted;
}

heap_page_prune_execute
依据heap_prune_chain
得到的结果，将对应元组的line pointer分别设置为LP_UNUSED、LP_REDIRECT及LP_DEAD，然后调用PageRepairFragmentation
进行页面空间的整理。PageRepairFragmentation
记录该page内需要保留的元组，然后调用compactify_tuples
删除对应的tuple。

void
PageRepairFragmentation(Page page)
{
 Offset pd_lower = ((PageHeader) page)->pd_lower;
 Offset pd_upper = ((PageHeader) page)->pd_upper;
 Offset pd_special = ((PageHeader) page)->pd_special;
 // 省略......
 /*
 * Run through the line pointer array and collect data about live items.
 */
 nline = PageGetMaxOffsetNumber(page);
 itemidptr = itemidbase;
 nunused = totallen = 0;
 last_offset = pd_special;

     记录需要保留的tuple
    for (i = FirstOffsetNumber; i <= nline; i++)
 {
 lp = PageGetItemId(page, i);

         仅(itemId)->lp_flags != LP_UNUSED 且 (itemId)->lp_len != 0需要保留
        if (ItemIdIsUsed(lp))
 {
 if (ItemIdHasStorage(lp))
 {
 itemidptr->offsetindex = i - 1;
 itemidptr->itemoff = ItemIdGetOffset(lp);

 if (last_offset > itemidptr->itemoff)
 last_offset = itemidptr->itemoff;
 else
 presorted = false;

 if (unlikely(itemidptr->itemoff < (int) pd_upper ||
 itemidptr->itemoff >= (int) pd_special))
 ereport(ERROR,
 (errcode(ERRCODE_DATA_CORRUPTED),
 errmsg("corrupted line pointer: %u",
 itemidptr->itemoff)));
 itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
 totallen += itemidptr->alignedlen;
 itemidptr++;
 }
 }
 else
 {
 /* Unused entries should have lp_len = 0, but make sure */
 ItemIdSetUnused(lp);
 nunused++;
 }
 }

 nstorage = itemidptr - itemidbase;
 if (nstorage == 0)
 {
 /* Page is completely empty, so just reset it quickly */
 // 如果没有元组需要保留，则直接修改pd_upper指向page的最后
        ((PageHeader) page)->pd_upper = pd_special;
 }
 else
 {
 /* Need to compact the page the hard way */
 if (totallen > (Size) (pd_special - pd_lower))
 ereport(ERROR,
 (errcode(ERRCODE_DATA_CORRUPTED),
 errmsg("corrupted item lengths: total %u, available space %u",
 (unsigned int) totallen, pd_special - pd_lower)));

 // 进行实际的元组删除操作，将需要保留的tuple按序移动至page的末尾
        compactify_tuples(itemidbase, nstorage, page, presorted);
 }
     省略......
}

值得注意的是，虽然heap_page_prune
已经删除了LP_UNUSED、LP_REDIRECT、LP_DEAD对应的tuple实体，但LP_REDIRECT、LP_DEAD类型的line pointer依旧保留，并没有被设置为LP_UNUSED，因此还不能被重用。lazy_scan_prune
完成对page的空间整理后，会记录当前被标记为LP_DEAD的所有item，当通过lazy_vacuum
删除这些死元组相关的索引项之后，才会将LP_DEAD类型的line pointer置为LP_UNUSED，此后该line pointer才能被重用于插入新的元组。

lazy_vacuum

lazy_vacuum
根据lazy_scan_prune
得到的死元组记录，执行对应的清理工作，主要包括：

通过lazy_vacuum_all_indexes
删除死元组对应的索引元组；
通过lazy_vacuum_heap_rel
将死元组的line pointer标识由LP_DEAD修改为LP_UNUSED，后续该line pointer即可被重用。

其中lazy_vacuum_all_indexes
对该heap表的每一个索引表，调用lazy_vacuum_one_index
对索引进行清理，lazy_vacuum_one_index
调用index_bulk_delete
，以btree为例，调用函数btbulkdelete -》btvacuumscan
，btvacuumscan
对索引的每一个page，调用btvacuumpage
进行page的清理。btvacuumpage
遍历索引页中的每一项，通过callback判断该项是否需要删除，callback指向lazy_tid_reaped
，lazy_tid_reaped
判断index tuple对应的tid是否存在于lazy_scan_prune
记录的dead tuple中，若存在则说明该项需要被删除，将其记录至deletable数组中，然后调用_bt_delitems_vacuum
删除deletable数组中记录的索引项。

static void
btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
{
 IndexVacuumInfo *info = vstate->info;
 IndexBulkDeleteResult *stats = vstate->stats;
 IndexBulkDeleteCallback callback = vstate->callback;
 void   *callback_state = vstate->callback_state;
 Relation rel = info->index;
 // 省略......

backtrack:
     省略......
 else if (P_ISLEAF(opaque))
 {
 OffsetNumber deletable[MaxIndexTuplesPerPage];
 int ndeletable;
         省略......

        if (callback)
 {
 // 依次扫描index page中的每一项
            for (offnum = minoff;
 offnum <= maxoff;
 offnum = OffsetNumberNext(offnum))
 {
 IndexTuple itup;

 itup = (IndexTuple) PageGetItem(page,
 PageGetItemId(page, offnum));
 Assert(!BTreeTupleIsPivot(itup));
 if (!BTreeTupleIsPosting(itup))
 {
 // 基于callback判断是否需要删除，若需要则记录至deletable数组中
                    * Regular tuple, standard table TID representation */
 if (callback(&itup->t_tid, callback_state))
 {
 deletable[ndeletable++] = offnum;
 nhtidsdead++;
 }
 else
 nhtidslive++;
 }
 // 省略......
 }
 }

 /*
 * Apply any needed deletes or updates.  We issue just one
 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 */
 if (ndeletable > 0 || nupdatable > 0)
 {
 // 调用_bt_delitems_vacuum批量删除
            Assert(nhtidsdead >= ndeletable + nupdatable);
 _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
 nupdatable);

 stats->tuples_removed += nhtidsdead;
 /* must recompute maxoff */
 maxoff = PageGetMaxOffsetNumber(page);

 /* can't leak memory here */
 for (int i = 0; i < nupdatable; i++)
 pfree(updatable[i]);
 }
 // 省略......
 }
     省略......
}

_bt_delitems_vacuum
调用PageIndexMultiDelete
进行删除，PageIndexMultiDelete
跳过需要删除的index line pointer，获取需要保留的index line pointer，将需要的index line pointer直接拷贝至索引page的pd_linp处，实现对index line pointer的清理。然后调用compactify_tuples
实现对index tuple的清理。

void
PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
{
 PageHeader phdr = (PageHeader) page;
 Offset pd_lower = phdr->pd_lower;
 Offset pd_upper = phdr->pd_upper;
 Offset pd_special = phdr->pd_special;
 Offset last_offset;
 // 省略......
 /*
 * Scan the line pointer array and build a list of just the ones we are
 * going to keep.  Notice we do not modify the page yet, since we are
 * still validity-checking.
 */
 nline = PageGetMaxOffsetNumber(page);
 itemidptr = itemidbase;
 totallen = 0;
 nused = 0;
 nextitm = 0;
 last_offset = pd_special;
 // 扫描每一项
    for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
 {
 lp = PageGetItemId(page, offnum);
 Assert(ItemIdHasStorage(lp));
 size = ItemIdGetLength(lp);
 offset = ItemIdGetOffset(lp);
         省略......
 // 跳过需要删除的item，保留需要的item
        if (nextitm < nitems && offnum == itemnos[nextitm])
 {
 /* skip item to be deleted */
 nextitm++;
 }
 else
 {
 itemidptr->offsetindex = nused; * where it will go */
 itemidptr->itemoff = offset;

 if (last_offset > itemidptr->itemoff)
 last_offset = itemidptr->itemoff;
 else
 presorted = false;

 itemidptr->alignedlen = MAXALIGN(size);
 totallen += itemidptr->alignedlen;
 newitemids[nused] = *lp;
 itemidptr++;
 nused++;
 }
 }
     省略......
 /*
 * Looks good. Overwrite the line pointers with the copy, from which we've
 * removed all the unused items.
 */
 // 将需要的items拷贝到pd_linp处
    memcpy(phdr->pd_linp, newitemids, nused * sizeof(ItemIdData));
 phdr->pd_lower = SizeOfPageHeaderData + nused * sizeof(ItemIdData);

     通过compactify_tuples进行index tuple的清理
 /* and compactify the tuple data */
 if (nused > 0)
 compactify_tuples(itemidbase, nused, page, presorted);
 else
 phdr->pd_upper = pd_special;
}

lazy_vacuum_all_indexes
完成对所有index tuple的清理之后，lazy_vacuum_heap_rel
进行最后的heap清理工作，lazy_vacuum_heap_rel
调用lazy_vacuum_heap_page
，将该page中死元组的line pointer由LP_DEAD修改为LP_UNUSED，此后该line pointer即可在插入新元组时被复用，并尝试删除line pointer数组末尾的unused line pointer。

static int
lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
  int tupindex, Buffer *vmbuffer)
{
 LVDeadTuples *dead_tuples = vacrel->dead_tuples;
 // 省略......
    // 遍历该page中的dead tuple
 for (; tupindex < dead_tuples->num_tuples; tupindex++)
 {
 BlockNumber tblk;
 OffsetNumber toff;
 ItemId itemid;

 tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
 if (tblk != blkno)
 break; /* past end of tuples for this block */
 toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 itemid = PageGetItemId(page, toff);

        // 将LP_DEAD的item修改为LP_UNUSED
 Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
        ItemIdSetUnused(itemid);
 unused[uncnt++] = toff;
 }

 Assert(uncnt > 0);

 /* Attempt to truncate line pointer array now */
 PageTruncateLinePointerArray(page);

 /*
 * Mark buffer dirty before we write WAL.
 */
 MarkBufferDirty(buffer);
    // 生成wal日志，省略......
}

总结

本文主要介绍了PostgreSQL VACUUM时对死元组的清理过程，包括对heap tuple的清理及对相关index tuple的清理，为了避免索引项指向一个不相关的heap tuple，必须在完成相关index tuple的清理之后，才可将对应的heap tuple设置为LP_UNUSED。除了清理死元组，VACUUM时还会进行事务xid的freeze，设置page的VM标识和FSM标识，尝试truncate表末尾的空页等，在后续文章中会对这部分清理工作进行详细分析。

文章转载自PolarDB，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PolarDB PostgreSQL vacuum数据清理