b-tree的索引页总览
pg的B-Tree是一种变种,算法详情请参考src/backend/access/nbtree/README,btee的索引页分为以下几种:
meta page->root page #btpo_flags=2->branch page #btpo_flags=0->leaf page #btpo_flags=1注:如果即是leaf又是root则 btpo_flags=3
其中mete Paga和root page是必备,meta page需要一个页来存储,表示指向root page的page id.随着记录的增加,一个root page可能存不下所有的heap items,所以就需要分裂成leaf page,如果继续分裂就存在了brache page和leaf page.
索引的结构图如下:

我们从一步一步从只包含root page,到root page->leaf page,再到root page-branch page->leaf page开始讲解。
研究btree利器pageinspect
create extension pgstatuple create extension pageinspect #查询page, index 详细信息
show how many pages in one table
select pg_relpages(regclass)
show one table tuple information
select * from pgstattuple(regclass)
show one table index information
select * from pgstatindex(regclass)
show one page information
select * from page_header(get_raw_page(relname text, 'main', page number))
show one page all tuples information
select * from heap_page_items(get_raw_page(relname text, 'main', page number))
show one index information
select * from bt_metap(relname text);
show one index page information
select * from bt_page_stats(relname text, page number)
show one index page all tuples information
select * from bt_page_items(relname text, page number)
bt_metap和bt_page_stats字段解释
为了方便后面的理解,我们首先解释一下两个表查询结果的字段解释。
postgres=# select * from bt_metap('tab1_pkey');magic | version | root | level | fastroot | fastlevel--------+---------+------+-------+----------+-----------340322 | 2 | 1 | 0 | 1 | 0root:表示root pagelevel:层级。0表示只有root层;1表示包含leaf层;2表示包含1级的branch和1级leaf,以此类推(存在多级的branch,只有1级的leaf)。
postgres=# select * from bt_page_stats('tab1_pkey',1);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3btno_flags:3:既是root也是leaf2:表示root层0:表示branch层1:表示leaf层btpo:表示当前在第几层。举个例子:如果bt_metap.level=3,表示两层branch和leaf。结构为:branch2(btpo=2)->branch1(btpo=1)->leaf0(btpo=0)btpo_next:下一个指向的Page idbtpo_prev:上一个指向的Page idtype:r:表示rootl:表示leafi:表示branchblkno:表示page no
0层结构(只有root page和meta)
0层结构,只有meta和root页。root页最多可以存储的item数,取决于索引字段数据的长度、以及索引页的大小。

postgres=# create table tbtree(id int primary key, info text);CREATE TABLEpostgres=# insert into tbtree select generate_series(1,100), md5(random()::text);INSERT 0 100postgres=# vacuum analyze tbtree;VACUUM
postgres=# select * from bt_metap('tbtree_pkey');magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------340322 | 4 | 1 | 0 | 1 | 0 | 0 | 100 | t(1 row)level=0:表示只有root层,没有branch和leafroot=1:表示root的page no=1.通过Page no=1继续往下查
postgres=# select * from bt_page_stats('tbtree_pkey',1);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3(1 row)根据page id=1查看结果解释:btpo_flags=3:表示既是root也是leafbtpo=0:表示已经是最后一层既然只有一层,我们根据这一层查看ctid的情况,根据blkno=1的条件查看
postgres=# select * from bt_page_items('tbtree_pkey',1);itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids------------+---------+---------+-------+------+-------------------------+------+---------+------1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (0,1) |2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 | f | (0,2) |3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 | f | (0,3) |....100 | (0,100) | 16 | f | f | 64 00 00 00 00 00 00 00 | f | (0,100) |0级ctid 表示存储的是heap页的寻址.如果是多层结构,那么branch page中的ctid, 它表示的是同级btree页(链条项ctid)或者下级btree页的寻址(具体我们后面例子会看到)当ctid指向heap时, data是对应的列值。(多级结构的data意义不一样,后面会讲)
1层结构
包括meta page, root page, leaf page.如果是边界页(branch or leaf),那么其中一个方向没有PAGE,这个方向的链表信息都统一指向meta page。(这里的说明很重要)

postgres=# truncate table tbtree;TRUNCATE TABLEpostgres=# insert into tbtree select generate_series(1,1000), md5(random()::text);INSERT 0 1000postgres=# vacuum analyze tbtree;VACUUM
postgres=# select * from bt_metap('tbtree_pkey');magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------340322 | 4 | 3 | 1 | 3 | 1 | 0 | 1000 | t(1 row)level=1:表示有root层和leafroot=3:表示root的page no=3.通过Page no=3继续往下查
postgres=# select * from bt_page_stats('tbtree_pkey',3);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2(1 row)根据page no=3查看结果解释:btpo_flags=2:表示这个是root层btpo=1:表示第一层(最底层btpo=0, 这种页里面存储的ctid才代表指向heap page的地址)live_items=3:如果不是最后一层,既leaf层,表示blkno=3下面还有3个page我们继续根据blkno=3去bt_page_items查找
blkno=3:postgres=# select * from bt_page_items('tbtree_pkey',3);itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids------------+-------+---------+-------+------+-------------------------+------+------+------1 | (1,0) | 8 | f | f | | | |2 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00 | | |3 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00 | | |(3 rows)第一条data为空,表示最左边的page,不保存最小值第二条data为ctid=(2,1),2表示pageno=2,1表示偏移量。以此类推。既然这里有page no=1 2 4,那么我们分别查看一看bt_page_stats
blkno=1:postgres=# select * from bt_page_stats('tbtree_pkey',1);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1(1 row)btpo_flags=1 说明是leaf层btpo=0 说明是最后一层了btpo_next=2 说明下一个page no=2btpo_prev 说明上一个page no=0,指向metalive_itmes 说明存在367条记录(最后一层的live_items代表tuple数,其他层的代表page数)blkno=2:postgres=# select * from bt_page_stats('tbtree_pkey',2);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1(1 row)btpo_next=4 说明下一个page no=4,继续查看blkno=4:postgres=# select * from bt_page_stats('tbtree_pkey',4);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1(1 row)btpo_next=0 说明下一个指向metabtpo_prev=2 说明上一个pageno=2
总结:通过这个例子我们可以看出来leaf层的最左(pageno=1)的上一个pageno为meta leaf层的最右(pageno=4)下一个pageno为meta.其余的leaf page都是双向链表。
2层结构
记录数超过1层结构的索引可以存储的记录数时,会分裂为2层结构,除了meta page和root page,还可能包含1层branch page以及1层leaf page。 如果是边界页(branch or leaf),那么其中一个方向没有PAGE,这个方向的链表信息都统一指向meta page。(这里的说明很重要)

create table tab2(id int primary key, info text);postgres=# select 285^2;?column?----------81225postgres=# insert into tab2 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint tab2_pkey do nothing;INSERT 0 951384postgres=# vacuum analyze tab2;VACUUM
postgres=# select * from bt_metap('tab2_pkey');magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------340322 | 4 | 412 | 2 | 412 | 2 | 0 | 951384 | t(1 row)level=2:表示有root、branch、leafroot=412:表示root的page no=421.通过Page no=412继续往下查
root层查看
blkno=412(root层)postgres=# select * from bt_page_stats('tab2_pkey', 412);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------412 | r | 11 | 0 | 15 | 8192 | 7936 | 0 | 0 | 2 | 2(1 row)根据page no=412查看结果解释:btpo_flags=2:表示这个是root层btpo=1:表示第一层(最底层btpo=0, 这种页里面存储的ctid才代表指向heap page的地址)live_items=11:如果不是最后一层,既leaf层,表示blkno=11下面还有11个page我们继续根据blkno=412去bt_page_items查找。
branch的page no查询,branch层查看
postgres=# select * from bt_page_items('tab2_pkey', 412);itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids------------+----------+---------+-------+------+-------------------------+------+------+------1 | (3,0) | 8 | f | f | | | |2 | (2490,1) | 16 | f | f | 90 2b 0b 00 00 00 00 00 | | |3 | (1251,1) | 16 | f | f | 29 66 17 00 00 00 00 00 | | |4 | (2498,1) | 16 | f | f | 32 39 25 00 00 00 00 00 | | |5 | (625,1) | 16 | f | f | e1 d6 31 00 00 00 00 00 | | |6 | (2502,1) | 16 | f | f | ce 28 3e 00 00 00 00 00 | | |7 | (1244,1) | 16 | f | f | 43 9b 4b 00 00 00 00 00 | | |8 | (2461,1) | 16 | f | f | 1a 2d 59 00 00 00 00 00 | | |9 | (411,1) | 16 | f | f | fc b8 65 00 00 00 00 00 | | |10 | (1817,1) | 16 | f | f | 3f 09 77 00 00 00 00 00 | | |11 | (1138,1) | 16 | f | f | 1f 69 89 00 00 00 00 00 | | |(11 rows)总共11个page,我们pageno=3可以了
leaf层查看
postgres=# select * from bt_page_stats('tab2_pkey', 3);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------3 | i | 229 | 0 | 15 | 8192 | 3576 | 0 | 2490 | 1 | 0(1 row)btpo_flags=0 表示branchbtpo=1 表示第一层继续查看blkno的Items,通过bt_page_itemspostgres=# select * from bt_page_items('tab2_pkey', 3);itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids------------+----------+---------+-------+------+-------------------------+------+------+------1 | (967,1) | 16 | f | f | 90 2b 0b 00 00 00 00 00 | | |2 | (1,0) | 8 | f | f | | | |3 | (1852,1) | 16 | f | f | 45 0f 00 00 00 00 00 00 | | |4 | (902,1) | 16 | f | f | 76 21 00 00 00 00 00 00 | | |5 | (1921,1) | 16 | f | f | c3 32 00 00 00 00 00 00 | | |...229 | (1898,1) | 16 | f | f | 40 1c 0b 00 00 00 00 00 | | |(229 rows)上述的229条记录代表,branch层page no=3包含229个page页,这些页存相应的leaf查看blkno=1(为什么查看blkno=1,因为最左的索引page的data都为空):postgres=# select * from bt_page_stats('tab2_pkey', 1);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------1 | l | 374 | 0 | 16 | 8192 | 668 | 0 | 1852 | 0 | 1(1 row)btpo_flags=1 表示leafbtpo=0 表示第0层btpo_prev=0表示最左指向metalive_item=374表示对应到heap的记录postgres=# select * from bt_page_items('tab2_pkey', 1);itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids------------+------------+---------+-------+------+-------------------------+------+------------+------1 | (2920,1) | 16 | f | f | 45 0f 00 00 00 00 00 00 | | |2 | (153,41) | 16 | f | f | 00 00 00 00 00 00 00 00 | f | (153,41) |3 | (816,83) | 16 | f | f | 05 00 00 00 00 00 00 00 | f | (816,83) |4 | (5150,8) | 16 | f | f | 07 00 00 00 00 00 00 00 | f | (5150,8) |......如果我们根据索引页结构的原理,能推算出来(153,41)是最小值,取它就没错了。postgres=# select *from tab2 where ctid='(153,41)';id | info----+----------------------------------0 | 04cf1e2ceda0e9a2cc01e0a79331c933(1 row)postgres=# select min(id) from tab2;min-----0
多层结构
多层结构,除了meta page,还可能包含多层branch page,以及一层leaf page。

多层结构其实和上面的2层结构差不多,就是多了branch层。
meta page:postgres=# select * from bt_metap('tab3_pkey');magic | version | root | level | fastroot | fastlevel--------+---------+--------+-------+----------+-----------340322 | 2 | 116816 | 3 | 116816 | 3(1 row)root层:btpo=3表示第3层ppostgres=# select * from bt_page_stats('tab3_pkey', 116816);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags--------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------116816 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 3 | 2(1 row)root层下面的第一个branch:postgres=# select * from bt_page_items('tab2_pkey', 412);postgres=# select * from bt_page_items('tab3_pkey', 116816);itemoffset | ctid | itemlen | nulls | vars | data------------+------------+---------+-------+------+-------------------------1 | (412,1) | 8 | f | f |2 | (116815,1) | 16 | f | f | 5f 9e c5 01 00 00 00 003 | (198327,1) | 16 | f | f | bd 3c 8b 03 00 00 00 00(3 rows)第一个brahch:btpo=2表明是第2层postgres=# select * from bt_page_stats('tab3_pkey', 412);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------412 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 116815 | 2 | 0(1 row)第一个branch下面的branch,既第二个branch:ppostgres=# select * from bt_page_items('tab3_pkey', 412);itemoffset | ctid | itemlen | nulls | vars | data------------+-----------+---------+-------+------+-------------------------1 | (81636,1) | 16 | f | f | 5f 9e c5 01 00 00 00 00 -- 这是指向当前层级右页的ctid2 | (3,1) | 8 | f | f | -- 注意第一条初始值是这3 | (411,1) | 16 | f | f | 77 97 01 00 00 00 00 004 | (698,1) | 16 | f | f | ed 2e 03 00 00 00 00 00...286 | (81350,1) | 16 | f | f | e9 06 c4 01 00 00 00 00(286 rows)第二个branch查看:btpo=1表示第一层postgres=# select * from bt_page_stats('tab3_pkey', 3);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------3 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 411 | 1 | 0(1 row)第二个brach下面的leaf层postgres=# select * from bt_page_items('tab3_pkey', 3);itemoffset | ctid | itemlen | nulls | vars | data------------+---------+---------+-------+------+-------------------------1 | (287,1) | 16 | f | f | 77 97 01 00 00 00 00 002 | (1,1) | 8 | f | f |3 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 004 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00...286 | (286,1) | 16 | f | f | 09 96 01 00 00 00 00 00(286 rows)leaf层:btpo=0表示最后一层了postgres=# select * from bt_page_stats('tab3_pkey', 1);blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1(1 row)
实战表演
postgres=# insert into tbl1 select trunc(random()*10000000), md5(random()::text) from generate_series(1,5000000) on conflict on constraint tbl1_pkey do nothing;INSERT 0 3935181postgres=# vacuum ANALYZE tbl1;VACUUMpostgres=# select * from bt_metap('tbl1_pkey');magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------340322 | 4 | 412 | 2 | 412 | 2 | 0 | 3934710 | t(1 row)
索引结构如下:

命中0条记录
读了3个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page.因为从branch判断没有记录,索引没必要再去读leaf(注:执行计划中的rows=1)
postgres=# select id from tbl1 where id = 1;id----(0 rows)postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;QUERY PLAN----------------------------------------------------------------------------------------------------------------------------Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.45 rows=1 width=4) (actual time=0.041..0.042 rows=0 loops=1)Output: idIndex Cond: (tbl1.id = 1)Heap Fetches: 0Buffers: shared hit=2 read=1 #read=1 物理读Planning:Buffers: shared hit=8Planning Time: 0.224 msExecution Time: 0.098 ms(9 rows)首次有一个物理读 read=1postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;QUERY PLAN----------------------------------------------------------------------------------------------------------------------------Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.45 rows=1 width=4) (actual time=0.035..0.036 rows=0 loops=1)Output: idIndex Cond: (tbl1.id = 1)Heap Fetches: 0Buffers: shared hit=3Planning Time: 0.112 msExecution Time: 0.073 ms(7 rows)第二次3个缓存
命中1条记录
读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page.
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 0;QUERY PLAN----------------------------------------------------------------------------------------------------------------------------Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.45 rows=1 width=4) (actual time=0.120..0.122 rows=1 loops=1)Output: idIndex Cond: (tbl1.id = 0)Heap Fetches: 0Buffers: shared hit=4Planning Time: 0.108 msExecution Time: 0.163 ms(7 rows)
命中4条记录
读了22个INDEX PAGE, 1 meta page + 4* (1 root + 1 branch + 1 leaf) = 13
postgres=# explain (analyze,verbose,timing,costs,buffers) select id,ctid from tbl1 where id in( 0,3897332,4378675,2955292);QUERY PLAN-------------------------------------------------------------------------------------------------------------------------Index Scan using tbl1_pkey on public.tbl1 (cost=0.43..33.79 rows=4 width=10) (actual time=0.036..0.068 rows=4 loops=1)Output: id, ctidIndex Cond: (tbl1.id = ANY ('{0,3897332,4378675,2955292}'::integer[]))Buffers: shared hit=13Planning Time: 0.126 msExecution Time: 0.098 ms(6 rows)
命中两条记录之sql优化
下面两条SQL,一个用In,一个用>= and <=,呈现出不同的状态:第一个1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22 第二个1 meta page +1 root + 1 branch + 1 leaf = 4。说明这两条数据在一个page
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (0,1,2,3,5,6,7);QUERY PLAN-----------------------------------------------------------------------------------------------------------------------------Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..31.13 rows=7 width=4) (actual time=0.035..0.083 rows=1 loops=1)Output: idIndex Cond: (tbl1.id = ANY ('{0,1,2,3,5,6,7}'::integer[]))Heap Fetches: 0Buffers: shared hit=22Planning Time: 0.117 msExecution Time: 0.114 ms(7 rows)postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id >=0 and id<=7;QUERY PLAN----------------------------------------------------------------------------------------------------------------------------Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..4.51 rows=4 width=4) (actual time=0.014..0.016 rows=2 loops=1)Output: idIndex Cond: ((tbl1.id >= 0) AND (tbl1.id <= 7))Heap Fetches: 0Buffers: shared hit=4Planning:Buffers: shared hit=11Planning Time: 0.302 msExecution Time: 0.044 ms(9 rows)
参考:https://developer.aliyun.com/article/53701




