暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

postgres btree索引全解析

开源喵 2021-07-10
869

b-tree的索引页总览


pg的B-Tree是一种变种,算法详情请参考src/backend/access/nbtree/README,btee的索引页分为以下几种:


meta page
->root page #btpo_flags=2
->branch page #btpo_flags=0
->leaf page #btpo_flags=1
注:如果即是leaf又是root btpo_flags=3


其中mete Paga和root page是必备,meta page需要一个页来存储,表示指向root page的page id.随着记录的增加,一个root page可能存不下所有的heap items,所以就需要分裂成leaf page,如果继续分裂就存在了brache page和leaf page.
索引的结构图如下:


我们从一步一步从只包含root page,到root page->leaf page,再到root page-branch page->leaf page开始讲解。



研究btree利器pageinspect


create extension pgstatuple create extension pageinspect #查询page, index 详细信息


show how many pages in one table


select pg_relpages(regclass)


show one table tuple information


select * from pgstattuple(regclass)


show one table index information


select * from pgstatindex(regclass)


show one page information


select * from page_header(get_raw_page(relname text, 'main', page number))


show one page all tuples information


select * from heap_page_items(get_raw_page(relname text, 'main', page number))


show one index information


select * from bt_metap(relname text);


show one index page information


select * from bt_page_stats(relname text, page number)


show one index page all tuples information


select * from bt_page_items(relname text, page number)


bt_metap和bt_page_stats字段解释


为了方便后面的理解,我们首先解释一下两个表查询结果的字段解释。


postgres=# select * from bt_metap('tab1_pkey');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 1 | 0 | 1 | 0
root:表示root page
level:层级。0表示只有root层;1表示包含leaf层;2表示包含1级的branch1leaf,以此类推(存在多级的branch,只有1级的leaf)。


postgres=# select * from bt_page_stats('tab1_pkey',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3
btno_flags:
3:既是root也是leaf
2:表示root
0:表示branch
1:表示leaf
btpo:表示当前在第几层。举个例子:如果bt_metap.level=3,表示两层branchleaf。结构为:branch2(btpo=2)->branch1(btpo=1)->leaf0(btpo=0)
btpo_next:下一个指向的Page id
btpo_prev:上一个指向的Page id
type:
r:表示root
l:表示leaf
i:表示branch
blkno:表示page no


0层结构(只有root page和meta)


0层结构,只有meta和root页。root页最多可以存储的item数,取决于索引字段数据的长度、以及索引页的大小。




postgres=# create table tbtree(id int primary key, info text);
CREATE TABLE
postgres=# insert into tbtree select generate_series(1,100), md5(random()::text);
INSERT 0 100
postgres=# vacuum analyze tbtree;
VACUUM


postgres=# select * from bt_metap('tbtree_pkey');
magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------
340322 | 4 | 1 | 0 | 1 | 0 | 0 | 100 | t
(1 row)

level=0:表示只有root层,没有branchleaf
root=1:表示rootpage no=1.通过Page no=1继续往下查


postgres=# select * from bt_page_stats('tbtree_pkey',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3
(1 row)
根据page id=1查看结果解释:
btpo_flags=3:表示既是root也是leaf
btpo=0:表示已经是最后一层
既然只有一层,我们根据这一层查看ctid的情况,根据blkno=1的条件查看


postgres=# select * from bt_page_items('tbtree_pkey',1);
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
------------+---------+---------+-------+------+-------------------------+------+---------+------
1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (0,1) |
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 | f | (0,2) |
3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 | f | (0,3) |
....
100 | (0,100) | 16 | f | f | 64 00 00 00 00 00 00 00 | f | (0,100) |

0ctid 表示存储的是heap页的寻址.
如果是多层结构,那么branch page中的ctid 它表示的是同级btree页(链条项ctid)或者下级btree页的寻址(具体我们后面例子会看到)
ctid指向heap时, data是对应的列值。(多级结构的data意义不一样,后面会讲)


1层结构


包括meta page, root page, leaf page.如果是边界页(branch or leaf),那么其中一个方向没有PAGE,这个方向的链表信息都统一指向meta page。(这里的说明很重要)




postgres=#  truncate table tbtree;
TRUNCATE TABLE
postgres=# insert into tbtree select generate_series(1,1000), md5(random()::text);
INSERT 0 1000
postgres=# vacuum analyze tbtree;
VACUUM


postgres=# select * from bt_metap('tbtree_pkey');
magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------
340322 | 4 | 3 | 1 | 3 | 1 | 0 | 1000 | t
(1 row)

level=1:表示有root层和leaf
root=3:表示rootpage no=3.通过Page no=3继续往下查


postgres=# select * from bt_page_stats('tbtree_pkey',3);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2
(1 row)

根据page no=3查看结果解释:
btpo_flags=2:表示这个是root
btpo=1:表示第一层(最底层btpo=0, 这种页里面存储的ctid才代表指向heap page的地址)
live_items=3:如果不是最后一层,既leaf层,表示blkno=3下面还有3page
我们继续根据blkno=3bt_page_items查找


blkno=3:
postgres=# select * from bt_page_items('tbtree_pkey',3);
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
------------+-------+---------+-------+------+-------------------------+------+------+------
1 | (1,0) | 8 | f | f | | | |
2 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00 | | |
3 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00 | | |
(3 rows)

第一条data为空,表示最左边的page,不保存最小值
第二条datactid=(2,1),2表示pageno=2,1表示偏移量。以此类推。
既然这里有page no=1 2 4,那么我们分别查看一看bt_page_stats


blkno=1:
postgres=# select * from bt_page_stats('tbtree_pkey',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1
(1 row)

btpo_flags=1 说明是leaf
btpo=0 说明是最后一层了
btpo_next=2 说明下一个page no=2
btpo_prev 说明上一个page no=0,指向meta
live_itmes 说明存在367条记录(最后一层的live_items代表tuple数,其他层的代表page数)

blkno=2:
postgres=# select * from bt_page_stats('tbtree_pkey',2);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1
(1 row)

btpo_next=4 说明下一个page no=4,继续查看

blkno=4:
postgres=# select * from bt_page_stats('tbtree_pkey',4);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1
(1 row)

btpo_next=0 说明下一个指向meta
btpo_prev=2 说明上一个pageno=2


总结:通过这个例子我们可以看出来leaf层的最左(pageno=1)的上一个pageno为meta leaf层的最右(pageno=4)下一个pageno为meta.其余的leaf page都是双向链表。


2层结构


记录数超过1层结构的索引可以存储的记录数时,会分裂为2层结构,除了meta page和root page,还可能包含1层branch page以及1层leaf page。 如果是边界页(branch or leaf),那么其中一个方向没有PAGE,这个方向的链表信息都统一指向meta page。(这里的说明很重要)



create table tab2(id int primary key, info text);  
postgres=# select 285^2;
?column?
----------
81225
postgres=# insert into tab2 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint tab2_pkey do nothing;
INSERT 0 951384
postgres=# vacuum analyze tab2;
VACUUM


postgres=# select * from bt_metap('tab2_pkey');
magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------
340322 | 4 | 412 | 2 | 412 | 2 | 0 | 951384 | t
(1 row)

level=2:表示有rootbranchleaf
root=412:表示rootpage no=421.通过Page no=412继续往下查


root层查看


blkno=412(root层)
postgres=# select * from bt_page_stats('tab2_pkey', 412);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
412 | r | 11 | 0 | 15 | 8192 | 7936 | 0 | 0 | 2 | 2
(1 row)
根据page no=412查看结果解释:
btpo_flags=2:表示这个是root
btpo=1:表示第一层(最底层btpo=0, 这种页里面存储的ctid才代表指向heap page的地址)
live_items=11:如果不是最后一层,既leaf层,表示blkno=11下面还有11page
我们继续根据blkno=412bt_page_items查找。


branch的page no查询,branch层查看


postgres=# select * from bt_page_items('tab2_pkey', 412);
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
------------+----------+---------+-------+------+-------------------------+------+------+------
1 | (3,0) | 8 | f | f | | | |
2 | (2490,1) | 16 | f | f | 90 2b 0b 00 00 00 00 00 | | |
3 | (1251,1) | 16 | f | f | 29 66 17 00 00 00 00 00 | | |
4 | (2498,1) | 16 | f | f | 32 39 25 00 00 00 00 00 | | |
5 | (625,1) | 16 | f | f | e1 d6 31 00 00 00 00 00 | | |
6 | (2502,1) | 16 | f | f | ce 28 3e 00 00 00 00 00 | | |
7 | (1244,1) | 16 | f | f | 43 9b 4b 00 00 00 00 00 | | |
8 | (2461,1) | 16 | f | f | 1a 2d 59 00 00 00 00 00 | | |
9 | (411,1) | 16 | f | f | fc b8 65 00 00 00 00 00 | | |
10 | (1817,1) | 16 | f | f | 3f 09 77 00 00 00 00 00 | | |
11 | (1138,1) | 16 | f | f | 1f 69 89 00 00 00 00 00 | | |
(11 rows)
总共11page,我们pageno=3可以了


leaf层查看


postgres=# select * from bt_page_stats('tab2_pkey', 3);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3 | i | 229 | 0 | 15 | 8192 | 3576 | 0 | 2490 | 1 | 0
(1 row)

btpo_flags=0 表示branch
btpo=1 表示第一层
继续查看blknoItems,通过bt_page_items

postgres=# select * from bt_page_items('tab2_pkey', 3);
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
------------+----------+---------+-------+------+-------------------------+------+------+------
1 | (967,1) | 16 | f | f | 90 2b 0b 00 00 00 00 00 | | |
2 | (1,0) | 8 | f | f | | | |
3 | (1852,1) | 16 | f | f | 45 0f 00 00 00 00 00 00 | | |
4 | (902,1) | 16 | f | f | 76 21 00 00 00 00 00 00 | | |
5 | (1921,1) | 16 | f | f | c3 32 00 00 00 00 00 00 | | |
...
229 | (1898,1) | 16 | f | f | 40 1c 0b 00 00 00 00 00 | | |
(229 rows)
上述的229条记录代表,branchpage no=3包含229page页,这些页存相应的leaf

查看blkno=1(为什么查看blkno=1,因为最左的索引pagedata都为空):
postgres=# select * from bt_page_stats('tab2_pkey', 1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 374 | 0 | 16 | 8192 | 668 | 0 | 1852 | 0 | 1
(1 row)

btpo_flags=1 表示leaf
btpo=0 表示第0
btpo_prev=0表示最左指向meta
live_item=374表示对应到heap的记录

postgres=# select * from bt_page_items('tab2_pkey', 1);
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
------------+------------+---------+-------+------+-------------------------+------+------------+------
1 | (2920,1) | 16 | f | f | 45 0f 00 00 00 00 00 00 | | |
2 | (153,41) | 16 | f | f | 00 00 00 00 00 00 00 00 | f | (153,41) |
3 | (816,83) | 16 | f | f | 05 00 00 00 00 00 00 00 | f | (816,83) |
4 | (5150,8) | 16 | f | f | 07 00 00 00 00 00 00 00 | f | (5150,8) |
......
如果我们根据索引页结构的原理,能推算出来(153,41)是最小值,取它就没错了。
postgres=# select *from tab2 where ctid='(153,41)';
id | info
----+----------------------------------
0 | 04cf1e2ceda0e9a2cc01e0a79331c933
(1 row)

postgres=# select min(id) from tab2;
min
-----
0


多层结构


多层结构,除了meta page,还可能包含多层branch page,以及一层leaf page。


多层结构其实和上面的2层结构差不多,就是多了branch层。



meta page:
postgres=# select * from bt_metap('tab3_pkey');
magic | version | root | level | fastroot | fastlevel
--------+---------+--------+-------+----------+-----------
340322 | 2 | 116816 | 3 | 116816 | 3
(1 row)

root层:btpo=3表示第3
ppostgres=# select * from bt_page_stats('tab3_pkey', 116816);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
--------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
116816 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 3 | 2
(1 row)


root层下面的第一个branch:
postgres=# select * from bt_page_items('tab2_pkey', 412);
postgres=# select * from bt_page_items('tab3_pkey', 116816);
itemoffset | ctid | itemlen | nulls | vars | data
------------+------------+---------+-------+------+-------------------------
1 | (412,1) | 8 | f | f |
2 | (116815,1) | 16 | f | f | 5f 9e c5 01 00 00 00 00
3 | (198327,1) | 16 | f | f | bd 3c 8b 03 00 00 00 00
(3 rows)

第一个brahch:btpo=2表明是第2
postgres=# select * from bt_page_stats('tab3_pkey', 412);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
412 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 116815 | 2 | 0
(1 row)

第一个branch下面的branch,既第二个branch:
ppostgres=# select * from bt_page_items('tab3_pkey', 412);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-----------+---------+-------+------+-------------------------
1 | (81636,1) | 16 | f | f | 5f 9e c5 01 00 00 00 00 -- 这是指向当前层级右页的ctid
2 | (3,1) | 8 | f | f | -- 注意第一条初始值是这
3 | (411,1) | 16 | f | f | 77 97 01 00 00 00 00 00
4 | (698,1) | 16 | f | f | ed 2e 03 00 00 00 00 00
...
286 | (81350,1) | 16 | f | f | e9 06 c4 01 00 00 00 00
(286 rows)

第二个branch查看:btpo=1表示第一层
postgres=# select * from bt_page_stats('tab3_pkey', 3);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 411 | 1 | 0
(1 row)

第二个brach下面的leaf
postgres=# select * from bt_page_items('tab3_pkey', 3);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (287,1) | 16 | f | f | 77 97 01 00 00 00 00 00
2 | (1,1) | 8 | f | f |
3 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00
4 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00
...
286 | (286,1) | 16 | f | f | 09 96 01 00 00 00 00 00
(286 rows)

leaf层:btpo=0表示最后一层了
postgres=# select * from bt_page_stats('tab3_pkey', 1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1
(1 row)


实战表演


postgres=# insert into tbl1 select trunc(random()*10000000), md5(random()::text) from generate_series(1,5000000) on conflict on constraint tbl1_pkey do nothing;
INSERT 0 3935181

postgres=# vacuum ANALYZE tbl1;
VACUUM
postgres=# select * from bt_metap('tbl1_pkey');
magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------
340322 | 4 | 412 | 2 | 412 | 2 | 0 | 3934710 | t
(1 row)


索引结构如下:



命中0条记录


读了3个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page.因为从branch判断没有记录,索引没必要再去读leaf(注:执行计划中的rows=1)


postgres=#  select id from tbl1 where id = 1;
id
----
(0 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.45 rows=1 width=4) (actual time=0.041..0.042 rows=0 loops=1)
Output: id
Index Cond: (tbl1.id = 1)
Heap Fetches: 0
Buffers: shared hit=2 read=1 #read=1 物理读
Planning:
Buffers: shared hit=8
Planning Time: 0.224 ms
Execution Time: 0.098 ms
(9 rows)
首次有一个物理读 read=1

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.45 rows=1 width=4) (actual time=0.035..0.036 rows=0 loops=1)
Output: id
Index Cond: (tbl1.id = 1)
Heap Fetches: 0
Buffers: shared hit=3
Planning Time: 0.112 ms
Execution Time: 0.073 ms
(7 rows)
第二次3个缓存


命中1条记录


读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page.


postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.45 rows=1 width=4) (actual time=0.120..0.122 rows=1 loops=1)
Output: id
Index Cond: (tbl1.id = 0)
Heap Fetches: 0
Buffers: shared hit=4
Planning Time: 0.108 ms
Execution Time: 0.163 ms
(7 rows)


命中4条记录


读了22个INDEX PAGE, 1 meta page + 4* (1 root + 1 branch + 1 leaf) = 13


postgres=# explain (analyze,verbose,timing,costs,buffers) select id,ctid from tbl1 where id in( 0,3897332,4378675,2955292);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Index Scan using tbl1_pkey on public.tbl1 (cost=0.43..33.79 rows=4 width=10) (actual time=0.036..0.068 rows=4 loops=1)
Output: id, ctid
Index Cond: (tbl1.id = ANY ('{0,3897332,4378675,2955292}'::integer[]))
Buffers: shared hit=13
Planning Time: 0.126 ms
Execution Time: 0.098 ms
(6 rows)


命中两条记录之sql优化


下面两条SQL,一个用In,一个用>= and <=,呈现出不同的状态:第一个1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22 第二个1 meta page +1 root + 1 branch + 1 leaf = 4。说明这两条数据在一个page


postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (0,1,2,3,5,6,7);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..31.13 rows=7 width=4) (actual time=0.035..0.083 rows=1 loops=1)
Output: id
Index Cond: (tbl1.id = ANY ('{0,1,2,3,5,6,7}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=22
Planning Time: 0.117 ms
Execution Time: 0.114 ms
(7 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id >=0 and id<=7;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.51 rows=4 width=4) (actual time=0.014..0.016 rows=2 loops=1)
Output: id
Index Cond: ((tbl1.id >= 0) AND (tbl1.id <= 7))
Heap Fetches: 0
Buffers: shared hit=4
Planning:
Buffers: shared hit=11
Planning Time: 0.302 ms
Execution Time: 0.044 ms
(9 rows)


参考:https://developer.aliyun.com/article/53701


文章转载自开源喵,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论