Postgresql page 剖析

小陈的技术博客 2022-11-23

1006

Postgresql page 解析

一、 pg存储结构

在pg中，磁盘存储和内存存储的最小管理单位都是块。

保存在磁盘中的数据块称为page；内存中的数据块称为buffer；表和索引称为relation；行称为tuple。

默认情况下，pg数据块大小为8k，page是pg的最小存取单位。

1. pg数据页

数据页由页头（PageHeaderData）、数据指针数组（ItemIdData）、可使用的空闲空间（Free space）、实际数据（Items）和特殊空间（Special Space）组成。

1.1 页头存储了LSN号、校验位等元数据信息，占用了24Bytes。

1.2 数据指针数组存储指向实际数据的指针。数据中的元素ItemId可理解为相应数据行在Page中的实际开始偏移。数据行指针ItemId由3部分组成，前15位为page内偏移，中间2位为标志，后面15位为长度。

1.3 空闲空间未使用可分配的空间，ItemID从空闲空间的头部开始分配，Item（数据行）从空闲空间的尾部开始分配。

1.4 实际数据未数据的行数据。

1.5 特殊空间用于存储索引访问使用的数据，不同的访问方法数据不同。

2. 页头详解

2.1 页头结构体代码

less src/include/storage/bufpage.h
typedef struct PageHeaderData
{
        /* XXX LSN is member of *any* block, not only page-organized ones */
        PageXLogRecPtr pd_lsn;          /* LSN: next byte after last byte of xlog
                                                                 * record for last change to this page */
        uint16          pd_checksum;    /* checksum */
        uint16          pd_flags;               /* flag bits, see below */
        LocationIndex pd_lower;         /* offset to start of free space */
        LocationIndex pd_upper;         /* offset to end of free space */
        LocationIndex pd_special;       /* offset to start of special space */
        uint16          pd_pagesize_version;
        TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
        ItemIdData      pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;

2.2 pg数据页存储剖析

简单介绍一下hexdump工具：

> hexdump 以 ASCII、十进制、十六进制或八进制显示文件内容。

> -s <offset>：跳过开头制定长度个字节

> -n <length>：只解析输入的指定长度个字节

psql -Upostgres
-- 创建测试表和数据
drop table if exists t_page;
create table t_page (id int,c1 char(8),c2 varchar(16));
insert into t_page values(1,'1','a');
insert into t_page values(2,'2','b');
insert into t_page values(3,'3','c');
insert into t_page values(4,'4','d');
-- 获取表对应的数据文件
postgres=# select pg_relation_filepath('t_page');
 pg_relation_filepath
----------------------
 base/5/16384

-- 查看表中的数据

$ hexdump -C $PGDATA/base/5/16384
00000000  00 00 00 00 70 07 50 01  00 00 00 00 28 00 60 1f  |....p.P.....(.`.|
00000010  00 20 04 20 00 00 00 00  d8 9f 4e 00 b0 9f 4e 00  |. . ......N...N.|
00000020  88 9f 4e 00 60 9f 4e 00  00 00 00 00 00 00 00 00  |..N.`.N.........|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001f60  d8 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001f70  04 00 03 00 02 09 18 00  04 00 00 00 13 34 20 20  |.............4  |
00001f80  20 20 20 20 20 05 64 00  d7 02 00 00 00 00 00 00  |     .d.........|
00001f90  00 00 00 00 00 00 00 00  03 00 03 00 02 09 18 00  |................|
00001fa0  03 00 00 00 13 33 20 20  20 20 20 20 20 05 63 00  |.....3       .c.|
00001fb0  d6 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001fc0  02 00 03 00 02 09 18 00  02 00 00 00 13 32 20 20  |.............2  |
00001fd0  20 20 20 20 20 05 62 00  d5 02 00 00 00 00 00 00  |     .b.........|
00001fe0  00 00 00 00 00 00 00 00  01 00 03 00 02 09 18 00  |................|
00001ff0  01 00 00 00 13 31 20 20  20 20 20 20 20 05 61 00  |.....1       .a.|
00002000

2.2.1 pd_lsn组成解析：

$ hexdump -C $PGDATA/base/5/16384 -s 0 -n 8
00000000  00 00 00 00 70 07 50 01                           |....p.P.|
00000008
/** 数据文件的前8个bytes存储的是lsn。
其中最开始的4个bytes是TimelineID，在这里是\x0000 0000
组合起来LSN为0/1500770
**/

2.2.2 pd_checksum（2bytes）解析

$ hexdump -C $PGDATA/base/5/16384 -s 8 -n 2
00000008  00 00                                             |..|
0000000a
/**
checksum为\x0000
**/

2.2.3 pg_flag（2bytes）

$ hexdump -C $PGDATA/base/5/16384 -s 10 -n 2
0000000a  00 00                                             |..|
0000000c
/**
flags为\x0000
**/

2.2.4 pg_lower（2bytes）

$ hexdump -C $PGDATA/base/5/16384 -s 12 -n 2
0000000c  28 00                                             |(.|
0000000e
/**
lower为\x0028，十进制为40
**/

2.2.5 pg_upper（2bytes）

hexdump -C $PGDATA/base/5/16384 -s 14 -n 2
0000000e  60 1f                                             |`.|
00000010
/**
upper为\x1f60，十进制为8032
**/

2.2.6 pg_special（2bytes）

$ hexdump -C $PGDATA/base/5/16384 -s 16 -n 2
00000010  00 20                                             |. |
00000012
/**
upper为\x2000，十进制为8192
**/

2.2.7 pg_pagesize_version（2bytes）

$ hexdump -C $PGDATA/base/5/16384 -s 18 -n 2
00000012  04 20                                             |. |
00000014
/**
pagesize_version为\x2004，十进制为8196，即版本4
**/

2.2.8 pg_prune_xid（4bytes）

$ hexdump -C $PGDATA/base/5/16384 -s 20 -n 4
00000014  00 00 00 00                                       |....|
00000018
/**
prune_xid为\x0000，即0
**/

2.3 ItemIds结构体

less src/include/storage/itemid.h
typedef struct ItemIdData
{
        unsigned        lp_off:15,              /* offset to tuple (from start of page) */
                                lp_flags:2,             /* state of line pointer, see below */
                                lp_len:15;              /* byte length of tuple */
} ItemIdData;
typedef ItemIdData *ItemId;

PageHeaderData之后是ItemId数组，每个元素占用的空间是4bytes。

2.3.1 lp_off

$ hexdump -C $PGDATA/base/5/16384 -s 24 -n 2
00000018  d8 9f                                             |..|
0000001a
/**
取低15位
echo $((0x9fd8 & ~$((1<<15))))
8152
表示第一个Item（tuple）从8152开始
**/

2.3.2 lp_len

$ hexdump -C $PGDATA/base/5/16384 -s 26 -n 2
0000001a  4e 00                                             |N.|
0000001c
/**
取高15位
echo $((0x004e >> 1))
39
表示第一个Item（tuple）的大小为39
**/

2.3.3 lp_flags

取第17-16位，01，即1

2.4 Items（tuple）结构体

每个tuple包含两部分，第一部分是tuple头部信息，第二部分是实际的数据。

less src/include/storage/off.h
/*
* OffsetNumber:
*
* this is a 1-based index into the linp (ItemIdData) array in the
* header of each disk page.
*/
typedef uint16 OffsetNumber;
 
less src/include/storage/block.h
/*
* BlockId:
*
* this is a storage type for BlockNumber.  in other words, this type
* is used for on-disk structures (e.g., in HeapTupleData) whereas
* BlockNumber is the type on which calculations are performed (e.g.,
* in access method code).
*
* there doesn't appear to be any reason to have separate types except
* for the fact that BlockIds can be SHORTALIGN'd (and therefore any
* structures that contains them, such as ItemPointerData, can also be
* SHORTALIGN'd).  this is an important consideration for reducing the
* space requirements of the line pointer (ItemIdData) array on each
* page and the header of each heap or index tuple, so it doesn't seem
* wise to change this without good reason.
*/
typedef struct BlockIdData
{
    uint16      bi_hi;
    uint16      bi_lo;
} BlockIdData;
 
typedef BlockIdData *BlockId; /* block identifier */
 
less src/include/storage/itemptr.h
/*
  * ItemPointer:
  *
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
  * (ItemIdData) array we want.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
  * structure padding bytes.  The struct is designed to be six bytes long
  * (it contains three int16 fields) but a few compilers will pad it to
  * eight bytes unless coerced.  We apply appropriate persuasion where
  * possible.  If your compiler can't be made to play along, you'll waste
  * lots of space.
  */
 typedef struct ItemPointerData
 {
     BlockIdData ip_blkid;
     OffsetNumber ip_posid;
 }
 
less src/include/access/htup_details.h
typedef struct HeapTupleFields
{
    TransactionId t_xmin;       /* inserting xact ID */
    TransactionId t_xmax;       /* deleting or locking xact ID */
    union
    {
        CommandId   t_cid;      /* inserting or deleting command ID, or both */
        TransactionId t_xvac;   /* old-style VACUUM FULL xact ID */
    }           t_field3;
} HeapTupleFields;
 
typedef struct DatumTupleFields
{
    int32       datum_len_;     /* varlena header (do not touch directly!) */
 
    int32       datum_typmod;   /* -1, or identifier of a record type */
 
    Oid         datum_typeid;   /* composite type OID, or RECORDOID */
 
    /*
     * datum_typeid cannot be a domain over composite, only plain composite,
     * even if the datum is meant as a value of a domain-over-composite type.
     * This is in line with the general principle that CoerceToDomain does not
     * change the physical representation of the base type value.
     *
     * Note: field ordering is chosen with thought that Oid might someday
     * widen to 64 bits.
     */
} DatumTupleFields;
 
struct HeapTupleHeaderData
{
    union
    {
        HeapTupleFields t_heap;
        DatumTupleFields t_datum;
    }           t_choice;
 
    ItemPointerData t_ctid;     /* current TID of this or newer tuple (or a
                                 * speculative insertion token) */
 
    /* Fields below here must match MinimalTupleData! */
 
#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2
    uint16      t_infomask2;    /* number of attributes + various flags */
 
#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3
    uint16      t_infomask;     /* various flag bits, see below */
 
#define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4
    uint8       t_hoff;         /* sizeof header incl. bitmap, padding */
 
    /* ^ - 23 bytes - ^ */
 
#define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5
    bits8       t_bits[FLEXIBLE_ARRAY_MEMBER];  /* bitmap of NULLs */
 
    /* MORE DATA FOLLOWS AT END OF STRUCT */
};

结构体展开，具体如下：

Field           Type            Length  Offset  Description
t_xmin          TransactionId   4 bytes 0       insert XID stamp
t_xmax          TransactionId   4 bytes 4       delete XID stamp
t_cid           CommandId       4 bytes 8       insert and/or delete CID stamp (overlays with t_xvac)
t_xvac          TransactionId   4 bytes 8       XID for VACUUM operation moving a row version
t_ctid          ItemPointerData 6 bytes 12      current TID of this or newer row version
t_infomask2     uint16          2 bytes 18      number of attributes, plus various flag bits
t_infomask      uint16          2 bytes 20      various flag bits
t_hoff          uint8           1 byte  22      offset to user data
//注意：t_cid和t_xvac为联合体，共用存储空间

2.5 Items（tuple）存储剖析

2.5.1 t_xmin

$ hexdump -C $PGDATA/base/5/16384 -s 8152 -n 4
00001fd8  d5 02 00 00                                       |....|
00001fdc
$ echo $((0x000002d5))
725
/**
第一个tuple的偏移为8152，所以从8152开始检索
**/

2.5.2 t_xmax

$ hexdump -C $PGDATA/base/5/16384 -s 8156 -n 4
00001fdc  00 00 00 00                                       |....|
00001fe0

2.5.3 t_cid/t_xvac

$ hexdump -C $PGDATA/base/5/16384 -s 8164 -n 6
00001fe4  00 00 00 00 01 00                                 |......|
00001fea

2.5.4 t_ctid

hexdump -C $PGDATA/base/5/16384 -s 8170 -n 2
00001fea  03 00                                             |..|
00001fec

2.5.5 t_infomask2

$ hexdump -C $PGDATA/base/5/16384 -s 8172 -n 2
00001fdc  00 00 00 00                                       |....|
00001fe0

2.5.5 t_infomask

$ hexdump -C $PGDATA/base/5/16384 -s 8172 -n 2
00001fec  02 09                                             |..|
00001fee

2.5.5 t_hoff

$ hexdump -C $PGDATA/base/5/16384 -s 8174 -n 1
00001fee  18                                                |.|
00001fef
$ echo $((0x18))
24
/**
用户数据开始偏移为24，用户数据偏移为8152+24=8176
**/

2.6 tuple存储剖析

上述是tuple的头部数据，接下来我们看看实际存储的数据。

tuple的总长度（lp_len）为39，计算得到数据大小为：39-24=15

$ hexdump -C $PGDATA/base/5/16384 -s 8176 -n 15
00001ff0  01 00 00 00 13 31 20 20  20 20 20 20 20 05 61     |.....1       .a|
00001fff
/**
表结构：create table t_page (id int,c1 char(8),c2 varchar(16));
第1个字段为int，第2个字段为定长字符，第3个字段为变长字符。
对应的数据为：
id=\x00000001
c1=\x133120202020202020
注：字符串无需高低位切换，\x13是标志位，ASCII码31是字符'1';
c2=\x0561
注：字符串，第1个字节\x05是标志位，ASCII码61是字符'a'
**/

3. pageinspect插件使用

pageinspect 提供了各种函数可用来快速方便的查看page中的内容。

3.1 安装指南

# 进入pg源码安装包解压目录
cd contrib/pageinspect
make && make install

3.2 使用指南

psql -Upostgres
postgres=# create extension pageinspect;


-- 查看page header
postgres=# \x
postgres=# SELECT * FROM page_header(get_raw_page('t_page', 0));
-[ RECORD 1 ]--------
lsn       | 0/1500770
checksum  | 0
flags     | 0
lower     | 40
upper     | 8032
special   | 8192
pagesize  | 8192
version   | 4
prune_xid | 0


-- 查看item
postgres=# select * from heap_page_items(get_raw_page('t_page',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |    725 |      0 |        0 | (0,1)  |           3 |       2306 |     24 |        |       | \x010000001331202020202020200561
  2 |   8112 |        1 |     39 |    726 |      0 |        0 | (0,2)  |           3 |       2306 |     24 |        |       | \x020000001332202020202020200562
  3 |   8072 |        1 |     39 |    727 |      0 |        0 | (0,3)  |           3 |       2306 |     24 |        |       | \x030000001333202020202020200563
  4 |   8032 |        1 |     39 |    728 |      0 |        0 | (0,4)  |           3 |       2306 |     24 |        |       | \x040000001334202020202020200564
(4 rows)
-- 查看page中的raw内容
postgres=# \x
postgres=# select * from get_raw_page('t_new', 0);

4. 文末总结

本文介绍postgresql页存储结构，我们可以观察到，pg为了进行实际的数据查询和mvcc等机制，添加了很多的信息去做记录，实际占用的空间比实际的数据大很多。如果使用列式存储可以有效的压缩空间。

postgresql

文章转载自小陈的技术博客，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Postgresql page 剖析

评论