PolarDB PostgreSQL vacuum数据清理
关于 PolarDB PostgreSQL 版
PolarDB PostgreSQL 版是一款阿里云自主研发的云原生关系型数据库产品,100% 兼容 PostgreSQL,高度兼容Oracle语法;采用基于 Shared-Storage 的存储计算分离架构,具有极致弹性、毫秒级延迟、HTAP 、Ganos全空间数据处理能力和高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB PostgreSQL 版具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载。
VACUUM
PostgreSQL通过MVCC多版本实现并发控制,每个写操作都会创建一个新版本的数据项,并保留其旧版本。当事务读取数据对象时,系统会选择基于该事务快照可见的一个版本,通过这种方式来确保各个事务间的相互隔离。当旧版本数据项对系统中的所有事务均不可见时,vacuum机制会对这些死元组进行清理,从而避免数据膨胀。对特定表的自动清理由autovacuum worker进行,清理过程主要包括:
移除死元组:移除每个page中的dead tuple及指向dead tuple的index tuple;
freeze旧的事务标识:freeze旧元组的事务标识,用于防止xid回卷;
更新表的空闲空间映射FSM和可见性映射VM,判断是否需要truncate heap最后的空页,根据freeze后得到的datfrozenxid删除不再需要的clog等事务文件。
以下对第一部分清理死元组的过程进行详细介绍。
Page结构
PostgreSQL中每个表对应page的结构如下所示,由PageHeaderData + line pointer + tuple + special space四部分组成
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
*
* specifically, while a disk block can be unformatted, a postgres
* disk page is always a slotted page of the form:
*
* +----------------+---------------------------------+
* | PageHeaderData | linp1 linp2 linp3 ... |
* +-----------+----+---------------------------------+
* | ... linpN | |
* +-----------+--------------------------------------+
* | ^ pd_lower |
* | |
* | v pd_upper |
* +-------------+------------------------------------+
* | | tupleN ... |
* +-------------+------------------+-----------------+
* | ... tuple3 tuple2 tuple1 | "special space" |
* +--------------------------------+-----------------+
* ^ pd_special
*
* a page is full when nothing can be added between pd_lower and
* pd_upper
*/
其中PageHeaderData结构如下:
typedef struct PageHeaderData
{
/* XXX LSN is member of *any* block, not only page-organized ones */
PageXLogRecPtr pd_lsn; * LSN: next byte after last byte of xlog
* record for last change to this page */
uint16 pd_checksum; * checksum */
uint16 pd_flags; * flag bits, see below */
LocationIndex pd_lower; * offset to start of free space */
LocationIndex pd_upper; * offset to end of free space */
LocationIndex pd_special; * offset to start of special space */
uint16 pd_pagesize_version;
TransactionId pd_prune_xid; * oldest prunable XID, or zero if none */
ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; * line pointer array */
} PageHeaderData;
pd_lsn:为最近一次修改该page所生成的xlog record对应的LSN
pd_checksum:该page的checksum,用于正确性校验
pd_flags:标识该页面是否有剩余空间,以及page中的所有tuple是否对所有事务均可见
pd_lower,pd_upper:分别指向空闲空间的开始及结束位置
pd_special:在索引页中会用到该字段,指向特殊空间的起始位置
pd_prune_xid:该page中所有可被prune的tuple对应xid的最小值
pd_linp:对应行指针的数组,行指针line pointer对应的数据结构如下:
typedef struct ItemIdData
{
unsigned lp_off:15, * offset to tuple (from start of page) */
lp_flags:2, * state of line pointer, see below */
lp_len:15; * byte length of tuple */
} ItemIdData;
/*
* lp_flags has these possible states. An UNUSED line pointer is available
* for immediate re-use, the other states are not.
*/
#define LP_UNUSED 0 * unused (should always have lp_len=0) */
#define LP_NORMAL 1 * used (should always have lp_len>0) */
#define LP_REDIRECT 2 * HOT redirect (should have lp_len=0) */
#define LP_DEAD 3 * dead, may or may not have storage */
lp_off:该line pointer指向的tuple在该page中的偏移
lp_flags:该行指针的状态:
LP_UNUSED:该行指针未被使用,在插入数据时可以被新元组复用;
LP_NORMAL:该行指针正常被使用
LP_REDIRECT:用于HOT链的重定位,对应的lp_off记录了下一个行指针对应的偏移
LP_DEAD:该行指针指向的元组为死元组,LP_DEAD类型的行指针不能被重用
lp_len:该行指针对应的tuple的长度
数据清理
了解了PostgreSQL中page的基本组织结构后,再看vacuum是如何对每个page进行清理的。对特定表的清理由函数vacuum_rel
完成,vacuum_rel
主要调用table_relation_vacuum
进行具体的清理工作,最终调用的函数为heap_vacuum_rel -》lazy_scan_heap
,主要过程如下:
依次扫描该表的每个page,根据VM bit判断是否可以跳过对该page的扫描,若标记为all visible/all frozen,说明该page中的tuple对所有事务均可见,无需进行清理;
通过
lazy_scan_prune
获取对应page中需要清理的死元组,依据lazy_scan_prune
的结果判断是否需要设置或清理 当前page的VISIBILITYMAP_ALL_VISIBLE及VISIBILITYMAP_ALL_FROZEN bit;基于
lazy_scan_prune
得到的死元组,通过lazy_vacuum
执行具体的清理工作。
lazy_scan_prune
lazy_scan_prune
调用heap_page_prune
进行实际的清理工作。如下,heap_page_prune
会扫描该page中的所有tuple,通过heap_prune_chain
遍历其HOT链,判断对应的tuple是否为dead tuple,并记录下需要设置为LP_UNUSED、LP_REDIRECT及LP_DEAD的line pointer;然后调用heap_page_prune_execute
处理这些元组。
int
heap_page_prune(Relation relation, Buffer buffer, ......)
{
int ndeleted = 0;
// 省略......
/* Scan the page */
for (offnum = FirstOffsetNumber;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;
/* Ignore items already processed as part of an earlier chain */
if (prstate.marked[offnum])
continue;
/* see preceding loop */
if (off_loc)
*off_loc = offnum;
/* Nothing to do if slot is empty or already dead */
itemid = PageGetItemId(page, offnum);
if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
continue;
// 记录需要设置为LP_UNUSED、LP_REDIRECT及LP_DEAD的item
* Process this item or chain of items */
ndeleted += heap_prune_chain(buffer, offnum, &prstate);
}
/* Have we found any prunable items? */
if (prstate.nredirected > 0 || prstate.ndead > 0 || prstate.nunused > 0)
{
/*
* Apply the planned item changes, then repair page fragmentation, and
* update the page's hint bit about whether it has free line pointers.
*/
通过heap_page_prune_execute 批量处理这些items
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);
MarkBufferDirty(buffer);
生成对应的wal日志,省略......
}
省略......
return ndeleted;
}
heap_page_prune_execute
依据heap_prune_chain
得到的结果,将对应元组的line pointer分别设置为LP_UNUSED、LP_REDIRECT及LP_DEAD,然后调用PageRepairFragmentation
进行页面空间的整理。PageRepairFragmentation
记录该page内需要保留的元组,然后调用compactify_tuples
删除对应的tuple。
void
PageRepairFragmentation(Page page)
{
Offset pd_lower = ((PageHeader) page)->pd_lower;
Offset pd_upper = ((PageHeader) page)->pd_upper;
Offset pd_special = ((PageHeader) page)->pd_special;
// 省略......
/*
* Run through the line pointer array and collect data about live items.
*/
nline = PageGetMaxOffsetNumber(page);
itemidptr = itemidbase;
nunused = totallen = 0;
last_offset = pd_special;
记录需要保留的tuple
for (i = FirstOffsetNumber; i <= nline; i++)
{
lp = PageGetItemId(page, i);
仅(itemId)->lp_flags != LP_UNUSED 且 (itemId)->lp_len != 0需要保留
if (ItemIdIsUsed(lp))
{
if (ItemIdHasStorage(lp))
{
itemidptr->offsetindex = i - 1;
itemidptr->itemoff = ItemIdGetOffset(lp);
if (last_offset > itemidptr->itemoff)
last_offset = itemidptr->itemoff;
else
presorted = false;
if (unlikely(itemidptr->itemoff < (int) pd_upper ||
itemidptr->itemoff >= (int) pd_special))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted line pointer: %u",
itemidptr->itemoff)));
itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
totallen += itemidptr->alignedlen;
itemidptr++;
}
}
else
{
/* Unused entries should have lp_len = 0, but make sure */
ItemIdSetUnused(lp);
nunused++;
}
}
nstorage = itemidptr - itemidbase;
if (nstorage == 0)
{
/* Page is completely empty, so just reset it quickly */
// 如果没有元组需要保留,则直接修改pd_upper指向page的最后
((PageHeader) page)->pd_upper = pd_special;
}
else
{
/* Need to compact the page the hard way */
if (totallen > (Size) (pd_special - pd_lower))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted item lengths: total %u, available space %u",
(unsigned int) totallen, pd_special - pd_lower)));
// 进行实际的元组删除操作,将需要保留的tuple按序移动至page的末尾
compactify_tuples(itemidbase, nstorage, page, presorted);
}
省略......
}
值得注意的是,虽然heap_page_prune
已经删除了LP_UNUSED、LP_REDIRECT、LP_DEAD对应的tuple实体,但LP_REDIRECT、LP_DEAD类型的line pointer依旧保留,并没有被设置为LP_UNUSED,因此还不能被重用。lazy_scan_prune
完成对page的空间整理后,会记录当前被标记为LP_DEAD的所有item,当通过lazy_vacuum
删除这些死元组相关的索引项之后,才会将LP_DEAD类型的line pointer置为LP_UNUSED,此后该line pointer才能被重用于插入新的元组。
lazy_vacuum
lazy_vacuum
根据lazy_scan_prune
得到的死元组记录,执行对应的清理工作,主要包括:
通过
lazy_vacuum_all_indexes
删除死元组对应的索引元组;通过
lazy_vacuum_heap_rel
将死元组的line pointer标识由LP_DEAD修改为LP_UNUSED,后续该line pointer即可被重用。
其中lazy_vacuum_all_indexes
对该heap表的每一个索引表,调用lazy_vacuum_one_index
对索引进行清理,lazy_vacuum_one_index
调用index_bulk_delete
,以btree为例,调用函数btbulkdelete -》btvacuumscan
,btvacuumscan
对索引的每一个page,调用btvacuumpage
进行page的清理。btvacuumpage
遍历索引页中的每一项,通过callback判断该项是否需要删除,callback指向lazy_tid_reaped
,lazy_tid_reaped
判断index tuple对应的tid是否存在于lazy_scan_prune
记录的dead tuple中,若存在则说明该项需要被删除,将其记录至deletable数组中,然后调用_bt_delitems_vacuum
删除deletable数组中记录的索引项。
static void
btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
IndexBulkDeleteCallback callback = vstate->callback;
void *callback_state = vstate->callback_state;
Relation rel = info->index;
// 省略......
backtrack:
省略......
else if (P_ISLEAF(opaque))
{
OffsetNumber deletable[MaxIndexTuplesPerPage];
int ndeletable;
省略......
if (callback)
{
// 依次扫描index page中的每一项
for (offnum = minoff;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
Assert(!BTreeTupleIsPivot(itup));
if (!BTreeTupleIsPosting(itup))
{
// 基于callback判断是否需要删除,若需要则记录至deletable数组中
* Regular tuple, standard table TID representation */
if (callback(&itup->t_tid, callback_state))
{
deletable[ndeletable++] = offnum;
nhtidsdead++;
}
else
nhtidslive++;
}
// 省略......
}
}
/*
* Apply any needed deletes or updates. We issue just one
* _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
*/
if (ndeletable > 0 || nupdatable > 0)
{
// 调用_bt_delitems_vacuum批量删除
Assert(nhtidsdead >= ndeletable + nupdatable);
_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
nupdatable);
stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
/* can't leak memory here */
for (int i = 0; i < nupdatable; i++)
pfree(updatable[i]);
}
// 省略......
}
省略......
}
_bt_delitems_vacuum
调用PageIndexMultiDelete
进行删除,PageIndexMultiDelete
跳过需要删除的index line pointer,获取需要保留的index line pointer,将需要的index line pointer直接拷贝至索引page的pd_linp处,实现对index line pointer的清理。然后调用compactify_tuples
实现对index tuple的清理。
void
PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
{
PageHeader phdr = (PageHeader) page;
Offset pd_lower = phdr->pd_lower;
Offset pd_upper = phdr->pd_upper;
Offset pd_special = phdr->pd_special;
Offset last_offset;
// 省略......
/*
* Scan the line pointer array and build a list of just the ones we are
* going to keep. Notice we do not modify the page yet, since we are
* still validity-checking.
*/
nline = PageGetMaxOffsetNumber(page);
itemidptr = itemidbase;
totallen = 0;
nused = 0;
nextitm = 0;
last_offset = pd_special;
// 扫描每一项
for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
{
lp = PageGetItemId(page, offnum);
Assert(ItemIdHasStorage(lp));
size = ItemIdGetLength(lp);
offset = ItemIdGetOffset(lp);
省略......
// 跳过需要删除的item,保留需要的item
if (nextitm < nitems && offnum == itemnos[nextitm])
{
/* skip item to be deleted */
nextitm++;
}
else
{
itemidptr->offsetindex = nused; * where it will go */
itemidptr->itemoff = offset;
if (last_offset > itemidptr->itemoff)
last_offset = itemidptr->itemoff;
else
presorted = false;
itemidptr->alignedlen = MAXALIGN(size);
totallen += itemidptr->alignedlen;
newitemids[nused] = *lp;
itemidptr++;
nused++;
}
}
省略......
/*
* Looks good. Overwrite the line pointers with the copy, from which we've
* removed all the unused items.
*/
// 将需要的items拷贝到pd_linp处
memcpy(phdr->pd_linp, newitemids, nused * sizeof(ItemIdData));
phdr->pd_lower = SizeOfPageHeaderData + nused * sizeof(ItemIdData);
通过compactify_tuples进行index tuple的清理
/* and compactify the tuple data */
if (nused > 0)
compactify_tuples(itemidbase, nused, page, presorted);
else
phdr->pd_upper = pd_special;
}
lazy_vacuum_all_indexes
完成对所有index tuple的清理之后,lazy_vacuum_heap_rel
进行最后的heap清理工作,lazy_vacuum_heap_rel
调用lazy_vacuum_heap_page
,将该page中死元组的line pointer由LP_DEAD修改为LP_UNUSED,此后该line pointer即可在插入新元组时被复用,并尝试删除line pointer数组末尾的unused line pointer。
static int
lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
int tupindex, Buffer *vmbuffer)
{
LVDeadTuples *dead_tuples = vacrel->dead_tuples;
// 省略......
// 遍历该page中的dead tuple
for (; tupindex < dead_tuples->num_tuples; tupindex++)
{
BlockNumber tblk;
OffsetNumber toff;
ItemId itemid;
tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
if (tblk != blkno)
break; /* past end of tuples for this block */
toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
itemid = PageGetItemId(page, toff);
// 将LP_DEAD的item修改为LP_UNUSED
Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
ItemIdSetUnused(itemid);
unused[uncnt++] = toff;
}
Assert(uncnt > 0);
/* Attempt to truncate line pointer array now */
PageTruncateLinePointerArray(page);
/*
* Mark buffer dirty before we write WAL.
*/
MarkBufferDirty(buffer);
// 生成wal日志,省略......
}
总结
本文主要介绍了PostgreSQL VACUUM时对死元组的清理过程,包括对heap tuple的清理及对相关index tuple的清理,为了避免索引项指向一个不相关的heap tuple,必须在完成相关index tuple的清理之后,才可将对应的heap tuple设置为LP_UNUSED。除了清理死元组,VACUUM时还会进行事务xid的freeze,设置page的VM标识和FSM标识,尝试truncate表末尾的空页等,在后续文章中会对这部分清理工作进行详细分析。




