暂无图片
暂无图片
3
暂无图片
暂无图片
暂无图片

MogDB/OpenGauss数据库存储结构浅析-relation heappage

1083

1、Page整体布局

MogDB/opengauss基于pg 9.2开发,在存储数据结构上基本上沿用PG的数据结构,并稍微修改。整体布局上这里盗了一张PG的图,进行介绍。

image.png

这张page结构图中少了在page尾部的special space数据区,一般存放与索引相关的特定数据,table relation没有该区域。

整理布局分为4个部分

1、PageHeaderData/HeapPageHeaderData 页面头/heap页面头数据

2、linp  array,行指针数组

3、free space 空闲空间

4、tuple数据,行数据

5、special space,在页面尾部,通过pagehead中的pd_special指向该区域, 一般存放与索引相关的特定数据,table relation page该区域为空,这张图没画出来。

下面逐个介绍

2、HeapPageHeaderData

./src/include/storage/bufpage.h

typedef struct {
    /* XXX LSN is member of *any* block, not only page-organized ones */
    PageXLogRecPtr pd_lsn;    /* LSN: next byte after last byte of xlog
                               * record for last change to this page */
    uint16 pd_checksum;       /* checksum */
    uint16 pd_flags;          /* flag bits, see below */
    LocationIndex pd_lower;   /* offset to start of free space */
    LocationIndex pd_upper;   /* offset to end of free space */
    LocationIndex pd_special; /* offset to start of special space */
    uint16 pd_pagesize_version;
    ShortTransactionId pd_prune_xid;           /* oldest prunable XID, or zero if none */
    TransactionId pd_xid_base;                 /* base value for transaction IDs on page */
    TransactionId pd_multi_base;               /* base value for multixact IDs on page */
    ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* beginning of line pointer array */
} HeapPageHeaderData;


disk page organization

space management information generic to any page

	pd_lsn		- identifies xlog record for last change to this page.
	pd_checksum	- page checksum, if set.
	pd_flags	- flag bits.
	pd_lower	- offset to start of free space.
	pd_upper	- offset to end of free space.
	pd_special	- offset to start of special space.
	pd_pagesize_version - size in bytes and page layout version number.
	pd_prune_xid - oldest XID among potentially prunable tuples on page.

pg的PageHeaderData为20字节,openguass的HeapPageHeaderData在此基础上增加了两个8字节数据项,pd_xid_base,pd_multi_base。这也说明opengauss把事物ID从4字节扩到了8字节.

  • 空闲空间起始位置
    pd_lower指向freespace开始,也就是linp array结尾处
    pd_upper执行freespace结束,也就是tuple data开始处

3、linp,行指针

image.png

typedef struct ItemIdData {
    unsigned lp_off : 15, /* offset to tuple (from start of page) */
        lp_flags : 2,     /* state of item pointer, see below */
        lp_len : 15;      /* byte length of tuple */
} ItemIdData;

lp_flags
define LP_UNUSED 0   /* unused (should always have lp_len=0) */
define LP_NORMAL 1   /* used (should always have lp_len>0) */
define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */
define LP_DEAD 3     /* dead, may or may not have storage */

  • 每个行指针大小为4字节,也就是32bit.
  • 低15位为lp_off,tuple起始位置在页面内的偏移量
  • 中间2位为lp_flags,标志信息
    LP_UNUSED 0  未使用,lp_len=0
    LP_NORMAL 1 正常状态,lp_len>0
    LP_REDIRECT 2 发生HOT时,执行行新的位置,lp_len=0
    LP_DEAD 3   行为dead
  • lp_len tuple占用空间长度
  • 计算linp数组在heappageheader之后,结束于pd_lower
    计算linp指针的个数,(pd_lower-linp开始位置)/4

4、freespace 空闲空间

页面内的空闲空间

linp从pd_lower处往后增长

tuple data从尾部,也就是pd_upper往前增长

5、tuple data

tuple数据,也是页面最重要的部分,由 tuple header+tuple data组成

tuple header

  • Tuple头部是由23byte固定大小的前缀和可选的NullBitMap构成。
    image.png

image.png

  • t_xmin:代表插入此元组的事务xid;
  • t_xmax:代表更新或者删除此元组的事务xid,如果该元组插入后未进行更新或者删除,t_xmax=0;
  • t_cid:command id,代表在当前事务中,已经执行过多少条sql,例如执行第一条sql时cid=0,执行第二条sql时cid=1;
  • t_ctid:待研究,在 pg中为update后旧版本指向新tuple的指针
  • t_infomask标志位,记录各种信息,如是否存在null列,是否有变长列,是否有OID列等
  • 如果有允许为空的列,则存在null bitmap,可以通过t_infomask判断 t_infomask&0x0001, bitmap的大小与列个数有关。
  • t_hoff 记录 header的大小,包含null bitmap,padding
  • tuple header后会有padding,使tuple header的大小为8的整数倍
typedef struct HeapTupleHeaderData {
    union {
        HeapTupleFields t_heap;
        DatumTupleFields t_datum;
    } t_choice;
    ItemPointerData t_ctid; /* current TID of this or newer tuple */
    /* Fields below here must match MinimalTupleData! */
    uint16 t_infomask2; /* number of attributes + various flags */
    uint16 t_infomask; /* various flag bits, see below */
    int8 t_hoff; /* sizeof header incl. bitmap, padding */
    /* ^ - 23 bytes - ^ */
    bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs -- VARIABLE LENGTH */
    /* MORE DATA FOLLOWS AT END OF STRUCT */
} HeapTupleHeaderData;

typedef struct HeapTupleFields {
    ShortTransactionId t_xmin; /* inserting xact ID */
    ShortTransactionId t_xmax; /* deleting or locking xact ID */
    union {
        CommandId t_cid;           /* inserting or deleting command ID, or both */
        ShortTransactionId t_xvac; /* old-style VACUUM FULL xact ID */
    } t_field3;
} HeapTupleFields;

/*
 * information stored in t_infomask:
 */
#define HEAP_HASNULL 0x0001          /* has null attribute(s) */
#define HEAP_HASVARWIDTH 0x0002      /* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL 0x0004      /* has external stored attribute(s) */
#define HEAP_HASOID 0x0008           /* has an object-id field */
#define HEAP_COMPRESSED 0x0010       /* has compressed data */
#define HEAP_COMBOCID 0x0020         /* t_cid is a combo cid */
#define HEAP_XMAX_EXCL_LOCK 0x0040   /* xmax is exclusive locker */
#define HEAP_XMAX_SHARED_LOCK 0x0080 /* xmax is shared locker */
/* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
#define HEAP_IS_LOCKED (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200   /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_INVALID | HEAP_XMIN_COMMITTED)
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
#define HEAP_XMAX_INVALID 0x0800   /* t_xmax invalid/aborted */
#define HEAP_XMAX_IS_MULTI 0x1000  /* t_xmax is a MultiXactId */
#define HEAP_UPDATED 0x2000        /* this is UPDATEd version of row */
#define HEAP_MOVED_OFF                          \
    0x4000 /* moved to another place by pre-9.0 \
            * VACUUM FULL; kept for binary      \
            * upgrade support */
#define HEAP_MOVED_IN                             \
    0x8000 /* moved from another place by pre-9.0 \
            * VACUUM FULL; kept for binary        \
            * upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)

#define HEAP_XACT_MASK 0xFFE0 /* visibility-related bits */

/*
 * information stored in t_infomask2:
 */
#define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
/* bits 0x1800 are available */
#define HEAP_HAS_REDIS_COLUMNS 0x2000 /* tuple has hidden columns added by redis */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000  /* this is heap-only tuple */

#define HEAP2_XACT_MASK 0xC000 /* visibility-related bits */

tuple  data

1、tuple data 有多个列(属性)组成

2、每个属性的长度分定长与变长

3、每个属性有不同的字节对齐

4、每个属性有不同的存储策略

5、tuple data中不存储null值

6、起始位置8字节对齐

变长类型(typlen = -1)存储策略

  • PLAN 避免压缩和行外存储

  • EXTENDED 允许压缩和行外存储,大多数可以TOAST的数据类型的默认策略

  • EXTERNAL 允许行外存储,但不许压缩

  • MAIN 允许压缩,尽量不使用行外存储

参考 pg_type.typstorage

  • p: Value must always be stored plain.
  • e: Value can be stored in a “secondary” relation (if relation has one, see pg_class.reltoastrelid).
  • m: Value can be stored compressed inline.
  • x: Value can be stored compressed inline or stored in “secondary” storage.

对齐(起始地址需要是某个长度的整数倍)

参考 pg_type.typalign

  • c = char alignment, i.e., no alignment needed.
  • s = short alignment (2 bytes on most machines).
  • i = int alignment (4 bytes on most machines).
  • d = double alignment (8 bytes on many machines, but by no means all).
 typalign | typstorage | count 
----------+------------+-------
 c        | p          |     6
 i        | m          |     3
 d        | x          |   269
 s        | p          |     3
 d        | p          |    17
 i        | p          |    39
 i        | x          |    70
(7 rows)
  

通过pg_type可以查到各个数据类型数据存储长度,对齐大小,存储策略

也可以通过pg_attribute去查表上各列的相关属性

如下面t表示例


create table  t(id int,id2 bigint,c varchar(100),d date,ts timestamp);
insert into t select i,i,'test'||i,now(),now() from generate_series(1,1000)i;

test=# SELECT a.attname,
test-#   pg_catalog.format_type(a.atttypid, a.atttypmod),attlen,
test-#   (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid) for 128)
test(#    FROM pg_catalog.pg_attrdef d
test(#    WHERE d.adrelid = a.attrelid AND d.adnum = a.attnum AND a.atthasdef),
test-#   a.attnotnull, a.attnum,
test-#   (SELECT c.collname FROM pg_catalog.pg_collation c, pg_catalog.pg_type t
test(#    WHERE c.oid = a.attcollation AND t.oid = a.atttypid AND a.attcollation <> t.typcollation) AS attcollation,
test-#   a.attidentity,
test-#   NULL AS indexdef,
test-#   NULL AS attfdwoptions,
test-#   a.attstorage,attalign,
test-#   CASE WHEN a.attstattarget=-1 THEN NULL ELSE a.attstattarget END AS attstattarget, pg_catalog.col_description(a.attrelid, a.attnum)
test-# FROM pg_catalog.pg_attribute a
test-# WHERE a.attrelid = 't'::regclass AND a.attnum > 0 AND NOT a.attisdropped
test-# ORDER BY a.attnum;
attname |         format_type         | attlen | substring | attnotnull | attnum | attstorage | attalign 
---------+-----------------------------+--------+-----------+------------+--------+------------+----------
id      | integer                     |      4 |           | f          |      1 | p          | i        
id2     | bigint                      |      8 |           | f          |      2 | p          | d        
c       | character varying(100)      |     -1 |           | f          |      3 | x          | i        
d       | date                        |      4 |           | f          |      4 | p          | i        
ts      | timestamp without time zone |      8 |           | f          |      5 | p          | d        
(5 rows)

test=# select * from heap_page_items(get_raw_page('t',0));
lp  | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                       t_data                                       
-----+--------+----------+--------+-----------+--------+----------+---------+-------------+------------+--------+--------+-------+------------------------------------------------------------------------------------
  1 |   8128 |        1 |     64 | 377048720 |      0 |        0 | (0,1)   |           5 |       2306 |     24 |        |       | \x010000000000000001000000000000000d74657374310000e91d000000000000961c1f75bc590200
  2 |   8064 |        1 |     64 | 377048720 |      0 |        0 | (0,2)   |           5 |       2306 |     24 |        |       | \x020000000000000002000000000000000d74657374320000e91d000000000000961c1f75bc590200
  3 |   8000 |        1 |     64 | 377048720 |      0 |        0 | (0,3)   |           5 |       2306 |     24 |        |       | \x030000000000000003000000000000000d74657374330000e91d000000000000961c1f75bc590200
  4 |   7936 |        1 |     64 | 377048720 |      0 |        0 | (0,4)   |           5 |       2306 |     24 |        |       | \x040000000000000004000000000000000d74657374340000e91d000000000000961c1f75bc590200
  5 |   7872 |        1 |     64 | 377048720 |      0 |        0 | (0,5)   |           5 |       2306 |     24 |        |       | \x050000000000000005000000000000000d74657374350000e91d000000000000961c1f75bc590200

                                                                                                                                 0100000000000000 0100000000000000 0d74657374310000 e91d000000000000 961c1f75bc590200
test=# select * from heap_page_item_attrs(get_raw_page('t',0),'t'); 
lp  | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                             t_attrs                                       
     
-----+--------+----------+--------+-----------+--------+----------+---------+-------------+------------+--------+--------+-------+-------------------------------------------------------------------------------------------
------
  1 |   8128 |        1 |     64 | 377048720 |      0 |        0 | (0,1)   |           5 |       2306 |     24 |        |       | {"\\x01000000","\\x0100000000000000","\\x0d7465737431","\\xe91d0000","\\x961c1f75bc590200"
}
  2 |   8064 |        1 |     64 | 377048720 |      0 |        0 | (0,2)   |           5 |       2306 |     24 |        |       | {"\\x02000000","\\x0200000000000000","\\x0d7465737432","\\xe91d0000","\\x961c1f75bc590200"
}
  3 |   8000 |        1 |     64 | 377048720 |      0 |        0 | (0,3)   |           5 |       2306 |     24 |        |       | {"\\x03000000","\\x0300000000000000","\\x0d7465737433","\\xe91d0000","\\x961c1f75bc590200"
}
  4 |   7936 |        1 |     64 | 377048720 |      0 |        0 | (0,4)   |           5 |       2306 |     24 |        |       | {"\\x04000000","\\x0400000000000000","\\x0d7465737434","\\xe91d0000","\\x961c1f75bc590200"
}
  5 |   7872 |        1 |     64 | 377048720 |      0 |        0 | (0,5)   |           5 |       2306 |     24 |        |       | {"\\x05000000","\\x0500000000000000","\\x0d7465737435","\\xe91d0000","\\x961c1f75bc590200"
}


变长类型的存储

定长数据类型根据pg_attribute.attlen对齐后直接存储在tupledata中,变长数据类型由于长度不固定,存储时有些特殊。需要在数据中保存长度等信息。
pg_attribute.attlen==-1则为变长类型
1、第一个字节(va_header)=0x01,第二个字节va_tag=18(VARTAG_INDIRECT)
toast机制
1 字节va_header+1字节va_tag+4*4字节varatt_external(toast指针)

typedef struct varatt_external {
    int32 va_rawsize;  /* Original data size (includes header) */
    int32 va_extsize;  /* External saved size (doesn't) */
    Oid va_valueid;    /* Unique ID of value within TOAST table */
    Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external;

2、short varlena,当长度<=126时
判断va_header!=0x01 and va_header&0x01==0x01

第一个字节前7位存储大小,最后一位为0x01
3、full 4-byte header varlena 
(前4字节>>2)&0x3FFFFFFF 数据长度

最后修改时间:2021-11-25 21:47:54
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论