8_PostgreSQL XLOG Record内部结构解析_Non-Backup Block
内容概述
PG WAL中的XLOG Record记录了数据库的变更日志信息,类似Oracle中Redo Log的VC(change vector),XLOG Record由header portion 和data portion两部分组成,data portion由header part和data part两部分组成,由于full_page_writes特性,data portion的类型可分为 Backup Block、Non-Backup Block和其它。本文对Non-Backup Block结构进行详细解析。
XLOG Record结构说明
XLOG Record组成
XLOG Record由header portion 和data portion两部分组成。

data portion组成
data portion由header part和data part两部分组成

data portion类型
data portion的类型可分为 Backup Block、Non-Backup Block和其它

测试脚本
### step 1. 创建table t1切换日志并查询当前日志
drop table t1;
create table t1(a1 varchar);
insert into t1 values(33333333);
checkpoint;
select pg_switch_wal();
SELECT pg_walfile_name(pg_current_wal_lsn());
### step 2. 插入一条记录,并生成新的日志文件
insert into t1 values(66666666);
insert into t1 values(99999999);
select pg_switch_wal();
SELECT pg_walfile_name(pg_current_wal_lsn());
postgres=# SELECT pg_walfile_name(pg_current_wal_lsn());
pg_walfile_name
--------------------------
00000001000000000000000F
(1 row)
postgres=#
select * from t1;
postgres=# select pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/5/32783
(1 row)
### step 3. dump日志二进制格式
[postgres@enmo pgwal]$ dd if=00000001000000000000000F bs=8192 skip=0 seek=0 count=1 |hexdump -C
00000000 10 d1 06 00 01 00 00 00 00 00 00 0f 00 00 00 00 |................|
00000010 00 00 00 00 00 00 00 00 c3 16 52 34 5f 3e 59 63 |..........R4_>Yc|
00000020 00 00 00 01 00 20 00 00 32 00 00 00 00 00 00 00 |..... ..2.......|
00000030 c8 ad 01 0e 00 00 00 00 10 08 00 00 67 f8 a5 5d |............g..]|
00000040 ff 18 00 00 00 00 00 00 00 00 00 b3 d3 52 f7 02 |.............R..|
00000050 00 00 f7 02 00 00 f6 02 00 00 00 00 00 00 00 00 |................|
00000060 a6 00 00 00 f7 02 00 00 28 00 00 0f 00 00 00 00 |........(.......|
00000070 00 0a 00 00 c2 0a aa c2 00 10 00 00 70 00 20 00 |............p. .|
00000080 03 7f 06 00 00 05 00 00 00 0f 80 00 00 00 00 00 |................|
00000090 00 ff 03 00 00 00 00 b8 ac 01 0e 00 00 00 00 20 |............... |
000000a0 00 b0 1f 00 20 04 20 00 00 00 00 d8 9f 42 00 b0 |.... . ......B..|
000000b0 9f 42 00 f7 02 00 00 00 00 00 00 00 00 00 00 00 |.B..............|
000000c0 00 00 00 02 00 01 00 02 08 18 00 13 36 36 36 36 |............6666|
000000d0 36 36 36 36 00 00 00 00 00 00 00 f6 02 00 00 00 |6666............|
000000e0 00 00 00 00 00 00 00 00 00 00 00 01 00 01 00 02 |................|
000000f0 08 18 00 13 33 33 33 33 33 33 33 33 00 00 00 00 |....33333333....|
00000100 00 00 00 02 00 00 00 00 22 00 00 00 f7 02 00 00 |........".......|
00000110 60 00 00 0f 00 00 00 00 00 01 00 00 0a 44 6b 1f |`............Dk.|
00000120 ff 08 33 bc 06 df 12 8f 02 00 00 00 00 00 00 00 |..3.............|
00000130 40 00 00 00 f8 02 00 00 08 01 00 0f 00 00 00 00 |@...............|
00000140 00 0a 00 00 ed 2a 59 40 00 20 0f 00 7f 06 00 00 |.....*Y@. ......|
00000150 05 00 00 00 0f 80 00 00 00 00 00 00 ff 03 01 00 |................|
00000160 02 08 18 00 13 39 39 39 39 39 39 39 39 03 00 00 |.....99999999...|
00000170 22 00 00 00 f8 02 00 00 30 01 00 0f 00 00 00 00 |".......0.......|
00000180 00 01 00 00 4d 92 8c cb ff 08 6a f7 06 df 12 8f |....M.....j.....|
00000190 02 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00 |................|
000001a0 70 01 00 0f 00 00 00 00 40 00 00 00 36 e1 24 75 |p.......@...6.$u|
000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00002000
[postgres@enmo pgwal]$
### step 4. pg_waldump日志信息
[postgres@enmo pgwal]$ pg_waldump 00000001000000000000000F
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0F000028, prev 0/0E01ADC8, desc: RUNNING_XACTS nextXid 759 latestCompletedXid 758 oldestRunningXid 759
rmgr: Heap len (rec/tot): 54/ 166, tx: 759, lsn: 0/0F000060, prev 0/0F000028, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/5/32783 blk 0 FPW
rmgr: Transaction len (rec/tot): 34/ 34, tx: 759, lsn: 0/0F000108, prev 0/0F000060, desc: COMMIT 2022-10-28 16:32:47.365171 CST
rmgr: Heap len (rec/tot): 64/ 64, tx: 760, lsn: 0/0F000130, prev 0/0F000108, desc: INSERT off 3 flags 0x00, blkref #0: rel 1663/5/32783 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 760, lsn: 0/0F000170, prev 0/0F000130, desc: COMMIT 2022-10-28 16:32:47.380330 CST
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: 0/0F000198, prev 0/0F000170, desc: SWITCH
[postgres@enmo pgwal]$
### step 5. 安装pageinspect扩展
cd /soft/postgresql-15beta1/contrib/pageinspect
postgres=# create extension pageinspect;
CREATE EXTENSION
postgres=#
postgres=# select * from heap_page_items(get_raw_page('t1', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------
1 | 8152 | 1 | 33 | 758 | 0 | 0 | (0,1) | 1 | 2050 | 24 | | | \x133333333333333333
2 | 8112 | 1 | 33 | 759 | 0 | 0 | (0,2) | 1 | 2050 | 24 | | | \x133636363636363636
3 | 8072 | 1 | 33 | 760 | 0 | 0 | (0,3) | 1 | 2050 | 24 | | | \x133939393939393939
(3 rows)
postgres=#
### step 6. 查询表空间信息
postgres=# select * from pg_tablespace;
oid | spcname | spcowner | spcacl | spcoptions
------+------------+----------+--------+------------
1663 | pg_default | 10 | |
1664 | pg_global | 10 | |
(2 rows)
postgres=#
WAL结构解析
第一部分为XLogLongPageHeaderData结构,上篇文章已解析过,后面6个XLogRecord结构,我们只解析2个insert操作,分别对应data portion类型的 Backup Block和Non-Backup Block。

Non-Backup Block 结构解析
insert语句产生的non-backup由以下4个数据结构体和1个数据对象组成,
- 结构体header-portion of XLogRecord
- 结构体XLogRecordBlockHeader
- 结构体XLogRecordDataHeaderShort
- an inserted tuple (to be exact, a xl_heap_header structure and an inserted data entire)
- 结构体xl_heap_insert

解析header-portion of XLogRecord
结构体定义
/*
* The overall layout of an XLOG record is:
* Fixed-size header (XLogRecord struct)
* XLogRecordBlockHeader struct
* XLogRecordBlockHeader struct
* ...
* XLogRecordDataHeader[Short|Long] struct
* block data
* block data
* ...
* main data
*
* There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of
* rmgr-specific data not associated with a block. XLogRecord structs
* always start on MAXALIGN boundaries in the WAL files, but the rest of
* the fields are not aligned.
*
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
*/
typedef struct XLogRecord <--大小:24 bytes
{
uint32 xl_tot_len; /* total len of entire record */ --typedef unsigned int uint32; /* == 32 bits */
TransactionId xl_xid; /* xact id */ --typedef uint32 TransactionId;
XLogRecPtr xl_prev; /* ptr to previous record in log */--typedef uint64 XLogRecPtr;
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */--typedef uint8 RmgrId;
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */--typedef uint32 pg_crc32c;
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;
XLogRecord结构体:./src/include/access/xlogrecord.h
二进制解析
header-portion of XLogRecord由前24bytes组成

uint32 xl_tot_len: 40 00 00 00
TransactionId xl_xid: f8 02 00 00
XLogRecPtr xl_prev: 08 01 00 0f 00 00 00 00
uint8 xl_info: 00
RmgrId xl_rmid: 0a
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc: ed 2a 59 40
解析XLogRecordBlockHeader of data portion
结构体定义
### XLogRecordBlockHeader
/*
* Header info for block data appended to an XLOG record.
*
* 'data_length' is the length of the rmgr-specific payload data associated
* with this block. It does not include the possible full page image, nor
* XLogRecordBlockHeader struct itself.
*
* Note that we don't attempt to align the XLogRecordBlockHeader struct!
* So, the struct must be copied to aligned local storage before use.
*/
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */
uint8 fork_flags; /* fork within the relation, and flags */
uint16 data_length; /* number of payload bytes (not including page
* image) */
/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;
#define SizeOfXLogRecordBlockHeader (offsetof(XLogRecordBlockHeader, data_length) + sizeof(uint16))
### RelFileNode
/*
* RelFileNode must provide all that we need to know to physically access
* a relation, with the exception of the backend ID, which can be provided
* separately. Note, however, that a "physical" relation is comprised of
* multiple files on the filesystem, as each fork is stored as a separate
* file, and each fork can be divided into multiple segments. See md.c.
*
* spcNode identifies the tablespace of the relation. It corresponds to
* pg_tablespace.oid.
*
* dbNode identifies the database of the relation. It is zero for
* "shared" relations (those common to all databases of a cluster).
* Nonzero dbNode values correspond to pg_database.oid.
*
* relNode identifies the specific relation. relNode corresponds to
* pg_class.relfilenode (NOT pg_class.oid, because we need to be able
* to assign new physical files to relations in some situations).
* Notice that relNode is only unique within a database in a particular
* tablespace.
*
* Note: spcNode must be GLOBALTABLESPACE_OID if and only if dbNode is
* zero. We support shared relations only in the "global" tablespace.
*
* Note: in pg_class we allow reltablespace == 0 to denote that the
* relation is stored in its database's "default" tablespace (as
* identified by pg_database.dattablespace). However this shorthand
* is NOT allowed in RelFileNode structs --- the real tablespace ID
* must be supplied when setting spcNode.
*
* Note: in pg_class, relfilenode can be zero to denote that the relation
* is a "mapped" relation, whose current true filenode number is available
* from relmapper.c. Again, this case is NOT allowed in RelFileNodes.
*
* Note: various places use RelFileNode in hashtable keys. Therefore,
* there *must not* be any unused padding bytes in this struct. That
* should be safe as long as all the fields are of type Oid.
*/
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */ --typedef unsigned int Oid;
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;
### BlockNumber
typedef uint32 BlockNumber;
XLogRecordBlockHeader结构体:./src/include/access/xlogrecord.h
二进制解析
XLogRecordBlockHeader由header-portion of XLogRecord后面的20个bytes组成。

uint8 id: 00 uint8 fork_flags: 20 uint16 data_length: 0f 00 RelFileNode: rel 1663/5/32783 Oid spcNode: 7f 06 00 00 Oid dbNode: 05 00 00 00 Oid relNode: 0f 80 00 00 BlockNumber: blk 0 <-- 000000
解析XLogRecordDataHeaderShort
结构体定义
/*
* XLogRecordDataHeaderShort/Long are used for the "main data" portion of
* the record. If the length of the data is less than 256 bytes, the short
* form is used, with a single byte to hold the length. Otherwise the long
* form is used.
*
* (These structs are currently not used in the code, they are here just for
* documentation purposes).
*/
typedef struct XLogRecordDataHeaderShort
{
uint8 id; /* XLR_BLOCK_ID_DATA_SHORT */
uint8 data_length; /* number of payload bytes */
} XLogRecordDataHeaderShort;
#define SizeOfXLogRecordDataHeaderShort (sizeof(uint8) * 2)
typedef struct XLogRecordDataHeaderLong
{
uint8 id; /* XLR_BLOCK_ID_DATA_LONG */
/* followed by uint32 data_length, unaligned */
} XLogRecordDataHeaderLong;
#define SizeOfXLogRecordDataHeaderLong (sizeof(uint8) + sizeof(uint32))
/*
* Block IDs used to distinguish different kinds of record fragments. Block
* references are numbered from 0 to XLR_MAX_BLOCK_ID. A rmgr is free to use
* any ID number in that range (although you should stick to small numbers,
* because the WAL machinery is optimized for that case). A few ID
* numbers are reserved to denote the "main" data portion of the record,
* as well as replication-supporting transaction metadata.
*
* The maximum is currently set at 32, quite arbitrarily. Most records only
* need a handful of block references, but there are a few exceptions that
* need more.
*/
#define XLR_MAX_BLOCK_ID 32
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
#define XLR_BLOCK_ID_ORIGIN 253
#define XLR_BLOCK_ID_TOPLEVEL_XID 252
二进制解析
XLogRecordDataHeaderShort由2 bytes组成

二进制解析
uint8 id: ff uint8 data_length: 03
解析xl_heap_header
结构体定义
/*
* We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
* or updated tuple in WAL; we can save a few bytes by reconstructing the
* fields that are available elsewhere in the WAL record, or perhaps just
* plain needn't be reconstructed. These are the fields we must store.
*/
typedef struct xl_heap_header
{
uint16 t_infomask2;
uint16 t_infomask;
uint8 t_hoff;
} xl_heap_header;
#define SizeOfHeapHeader (offsetof(xl_heap_header, t_hoff) + sizeof(uint8))
/* This is what we need to know about insert */
typedef struct xl_heap_insert
{
OffsetNumber offnum; /* inserted tuple's offset */ --typedef uint16 OffsetNumber;
uint8 flags;
/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;
#define SizeOfHeapInsert (offsetof(xl_heap_insert, flags) + sizeof(uint8))
二进制解析

二进制解析
typedef struct xl_heap_header
{
uint16 t_infomask2; -- 01 00
uint16 t_infomask; -- 02 08
uint8 t_hoff; -- 18
} xl_heap_header;
typedef struct xl_heap_insert
{
OffsetNumber offnum; --03 00 * inserted tuple's offset */ --typedef uint16 OffsetNumber;
uint8 flags; --00
/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;
最后修改时间:2022-11-02 19:48:38
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




