暂无图片
暂无图片
2
暂无图片
暂无图片
暂无图片

PG 的吐司面包 -- TOAST

原创 大表哥 2022-07-07
1391

image.png

大家好,今天大表哥给大家分享是 PG的 TOAST 的大对象存储方式。

TOAST 是 The Oversized-Attribute Storage Technique 的意思。 粗略的可以翻译成大对象存储技术。

通常PG的 一个数据页的大小是8KB ,并且不允许一个大的元祖跨越多个数据页。官方原文如下:

PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages 。

PG 采取的TOAST 技术就像是 大家平时吃的早餐吐司面包一样,把一个大的面包(元祖)用刀子切成一片一片的. 技术上叫分割成 chunk , 每个chunk 独立形成一行 存储在独立的toast 表中。

什么情况下,可以触发 TOAST 呢?
如果一个元祖的大小,超过 2KB(参数 TOAST_TUPLE_THRESHOLD 决定,这个是编译时候的默认参数,需要重新编译软件才生效),那么PG会尝试进行压缩,把每个元祖控制在2KB之内,如果不能满足2KB之内的需求,
这个时候就需要独立的 toast 的表来存储 oversized 的部分了。

自从PG 11版本之后,可以在建表的语句上进行表级别的修改: 指定参数 toast_tuple_target 来限定 TOAST的阈值

create table toast_table2 (id int, content text) WITH (toast_tuple_target=4096);

通常情况下,在 toast 表中的 chunk 的大小也是 2KB (实际上是1996 bytes)。 换成通俗的来说,就是每个面包片的大小是2KB (实际上是1996 bytes)。

下面我们来试着创建一张带有 toast 的表:

通常我们创建带有 text 类型的表,PG会自动帮我们创建一下先关的TOAST 的表:

可以通过reltoastrelid 这个的对象的oid , 找到具体的 toast 表的名字: pg_toast_33006

dbtest@[local:/tmp]:1992=#5427 create table test_toast(id int, content text); CREATE TABLE dbtest@[local:/tmp]:1992=#5427 select oid, reltoastrelid from pg_class where relname = 'test_toast'; oid | reltoastrelid -------+--------------- 33006 | 33009 (1 row) dbtest@[local:/tmp]:1992=#5427 select relname from pg_class where oid = 33009; relname ---------------- pg_toast_33006 (1 row)

我们来看一下 toast 的表结构: chunk_id, chunk_seq 构成了唯一的主键, chunk_data 存储的是实际的数据,也是我们说的吐司面片的切片

dbtest@[local:/tmp]:1992=#5427 \d pg_toast.pg_toast_33006 TOAST table "pg_toast.pg_toast_33006" Column | Type ------------+--------- chunk_id | oid chunk_seq | integer chunk_data | bytea Owning table: "public.test_toast" Indexes: "pg_toast_33006_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

接下来,我们创建一个函数来帮助我们生成并返回一个大的字符串:

CREATE OR REPLACE FUNCTION generate_random_string( length INTEGER, characters TEXT default '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz') RETURNS TEXT AS$$DECLARE result TEXT := '';BEGIN IF length < 1 then RAISE EXCEPTION 'Invalid length'; END IF; FOR __ IN 1..length LOOP result := result || substr(characters, floor(random() * length(characters))::int + 1, 1); end loop; RETURN result; END;$$ LANGUAGE plpgsql;

我们来测试一下这个函数:

dbtest@[local:/tmp]:1992=#5427 select generate_random_string(20); generate_random_string ------------------------ QB6BJJipFX3LkHoMoHjd (1 row)

我们尝试插入 4KB的字段长度的数据:

dbtest@[local:/tmp]:1992=#5427 insert into test_toast values (1,generate_random_string(1024*4)); INSERT 0 1

查看对应的 toast 表: 我们可以看到分割成了 3个chunk

dbtest@[local:/tmp]:1992=#5427 SELECT chunk_id, COUNT(*) as chunks, pg_size_pretty(sum(octet_length(chunk_data)::bigint))FROM pg_toast.pg_toast_33006 GROUP BY 1 ORDER BY 1; chunk_id | chunks | pg_size_pretty ----------+--------+---------------- 33012 | 3 | 4096 bytes (1 row)

每个chunk 的大小是: 我们可以看到 (seq=0,1)每个 trunk 的大小 是 1996 bytes, 符合小于 参数设定的 TOAST_TUPLE_TARGET 2048 bytes 的上限。

dbtest@[local:/tmp]:1992=#5427 SELECT chunk_id, chunk_seq, pg_size_pretty(sum(octet_length(chunk_data)::bigint))FROM pg_toast.pg_toast_33006 group by 1,2 ORDER BY 1; chunk_id | chunk_seq | pg_size_pretty ----------+-----------+---------------- 33012 | 2 | 104 bytes 33012 | 1 | 1996 bytes 33012 | 0 | 1996 bytes (3 rows)

我们可以看到 切吐司面变的过程就是一刀一刀的来切,每一刀对应着一个 chunk_seg , 核心的内部代码也是这么实现的:

官方文档上提到了 这个源码文件:toast_internals.c

具体的路径是: src/backend/access/common/toast_internals.c 中的这个函数 toast_save_datum

核心切片代码段 : data_todo 就是我们的吐司, chunk_size 就是面包片,循环的一刀一刀的切, 每切一刀, chunk_seq + 1

* Split up the item into chunks */ while (data_todo > 0) { int i; CHECK_FOR_INTERRUPTS(); /* * Calculate the size of this chunk */ chunk_size = Min(TOAST_MAX_CHUNK_SIZE, data_todo); /* * Build a tuple and store it */ t_values[1] = Int32GetDatum(chunk_seq++); SET_VARSIZE(&chunk_data, chunk_size + VARHDRSZ); memcpy(VARDATA(&chunk_data), data_p, chunk_size); toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull); heap_insert(toastrel, toasttup, mycid, options, NULL); ... ... /* * Free memory */ heap_freetuple(toasttup); /* * Move on to next chunk */ data_todo -= chunk_size; data_p += chunk_size; }

我们建表的时候 也可以指定 参数 TOAST_TUPLE_TARGET, 来限定触发TOAST的阈值:

取值范围是 128 to 8016

ERROR: value 64 out of bounds for option “toast_tuple_target”
DETAIL: Valid values are between “128” and “8160”.

这个数值越小,触发TOAST的几率就越大,也就是说数据压缩进入 TOAST 表的几率,就越大。

测试创建2张表:

db_test=# CREATE TABLE toast_test_default_threshold (id SERIAL, value TEXT); CREATE TABLE db_test=# CREATE TABLE toast_test_128_threshold (id SERIAL, value TEXT) WITH (toast_tuple_target=128); CREATE TABLE

插入相同的数据:

db_test=# INSERT INTO toast_test_default_threshold (value) VALUES (generate_random_string(2100, '123')); INSERT 0 1 db_test=# INSERT INTO toast_test_128_threshold (value) VALUES (generate_random_string(2100, '123')); INSERT 0 1

查询相关toast 表:

db_test=# SELECT c1.relname, c2.relname AS toast_relname db_test-# FROM pg_class c1 JOIN pg_class c2 ON c1.reltoastrelid = c2.oid db_test-# WHERE c1.relname LIKE '%toast%' AND c1.relkind = 'r'; relname | toast_relname ------------------------------+------------------ toast_test_128_threshold | pg_toast_1844245 toast_test_default_threshold | pg_toast_1844236

我们可以看到 TOAST_TUPLE_TARGET = 128 的表触发了 TOAST:

db_test=# SELECT db_test-# c1.relname, db_test-# pg_size_pretty(pg_relation_size(c1.relname::regclass)) AS size, db_test-# c2.relname AS toast_relname, db_test-# pg_size_pretty(pg_relation_size(('pg_toast.' || c2.relname)::regclass)) AS toast_size db_test-# FROM db_test-# pg_class c1 db_test-# JOIN pg_class c2 ON c1.reltoastrelid = c2.oid db_test-# WHERE db_test-# c1.relname LIKE 'toast_test_%' db_test-# AND c1.relkind = 'r'; relname | size | toast_relname | toast_size ------------------------------+------------+------------------+------------ toast_test_128_threshold | 8192 bytes | pg_toast_1844245 | 8192 bytes toast_test_default_threshold | 8192 bytes | pg_toast_1844236 | 0 bytes db_test=# SELECT chunk_id, chunk_seq, pg_size_pretty(sum(octet_length(chunk_data)::bigint))FROM pg_toast.pg_toast_1844245 group by 1,2 ORDER BY 1; chunk_id | chunk_seq | pg_size_pretty ----------+-----------+---------------- 1844252 | 0 | 759 bytes 1844265 | 0 | 1339 bytes (2 rows) db_test=# SELECT chunk_id, chunk_seq, pg_size_pretty(sum(octet_length(chunk_data)::bigint))FROM pg_toast.pg_toast_1844236 group by 1,2 ORDER BY 1; chunk_id | chunk_seq | pg_size_pretty ----------+-----------+---------------- (0 rows)

结论: TOAST_TUPLE_TARGET 这个数值越小,触发TOAST的几率就越大,也就是说数据压缩进入 TOAST 表的几率,就越大。

我们来看一下TOAST 表 进行全表扫描效率:

我们这次创建2 张表如下:

db_test=# CREATE TABLE toast_test_default_threshold (id SERIAL, value TEXT); CREATE TABLE Time: 6.394 ms db_test=# CREATE TABLE toast_test_4096_threshold (id SERIAL, value TEXT) WITH (toast_tuple_target=4096); CREATE TABLE Time: 6.591 ms

我们向2个表中插入 相同的 50万 的数据:

db_test=# create table tmp_tbl (id SERIAL, value TEXT); CREATE TABLE Time: 6.941 ms db_test=# WITH str AS (SELECT generate_random_string(2048) AS value) db_test-# INSERT INTO tmp_tbl (value) db_test-# SELECT value db_test-# FROM generate_series(1, 500000), str; INSERT 0 500000 Time: 12104.729 ms (00:12.105) db_test=# insert into toast_test_default_threshold select id, value from tmp_tbl; INSERT 0 500000 Time: 13982.846 ms (00:13.983) db_test=# insert into toast_test_4096_threshold select id, value from tmp_tbl; INSERT 0 500000 Time: 6855.358 ms (00:06.855)

查看表的大小:

db_test=# SELECT c1.relname, pg_size_pretty(pg_relation_size(c1.relname::regclass)) AS size, c2.relname AS toast_relname, pg_size_pretty(pg_relation_size(('pg_toast.' || c2.relname)::regclass)) AS toast_size FROM pg_class c1 JOIN pg_class c2 ON c1.reltoastrelid = c2.oid WHERE c1.relname LIKE 'toast_test_%' AND c1.relkind = 'r'; relname | size | toast_relname | toast_size ------------------------------+---------+------------------+------------ toast_test_default_threshold | 25 MB | pg_toast_4844457 | 1302 MB toast_test_4096_threshold | 1302 MB | pg_toast_4844467 | 0 bytes (2 rows)

我们可以测试一下 全表扫描的时间:

db_test=# EXPLAIN (ANALYZE, TIMING) SELECT * FROM toast_test_default_threshold; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on toast_test_default_threshold (cost=0.00..8185.00 rows=500000 width=22) (actual time=0.020..41.540 rows=500000 loops=1) Planning Time: 0.054 ms Execution Time: 58.474 ms (3 rows) db_test=# EXPLAIN (ANALYZE, TIMING) SELECT * FROM toast_test_4096_threshold; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on toast_test_4096_threshold (cost=0.00..171667.01 rows=500001 width=2056) (actual time=0.076..539.210 rows=500000 loops=1) Planning Time: 0.059 ms Execution Time: 558.273 ms (3 rows)

我们可以明显看出 在全包扫描的情况下: 带有 toast 存储的表 明显 比 不带toast 存储的表要快很多,

全包扫描耗时比: 大致 1:10

toast_test_default_threshold (带有toast ) VS toast_test_4096_threshold (不带toast ) = 43.163 ms : 473.915 ms

原因也是大家比较容易理解的: toast_test_default_threshold 本身 只有 25MB ,超出 2K 的部分都存储在了 toast的表中
而 toast_test_4096_threshold 这个表 由于我们手动的设置了 toast_tuple_target=4096 , 所以并没触发 toast 存储

relname | size | toast_relname | toast_size ------------------------------+---------+------------------+------------ toast_test_default_threshold | 25 MB | pg_toast_4844457 | 1302 MB toast_test_4096_threshold | 1302 MB | pg_toast_4844467 | 0 bytes

toast_test_default_threshold : value 以指针的形式指向了 toast 表

Image.png

toast_test_4096_threshold: 全部存储在了表内

Image.png

目前我们得出的结论是:全表过滤扫描, 那么优化器是不会选择扫描外部的toast 的表的,只会扫描本表中的指向toast表的指针。

现在即使我们排除大文本字段,只查询 ID 这个int 类型的字段, 2者耗时还是有巨大的差距, 因为优化器的选择的执行计划并没有变还是全表扫描

耗时比 依旧在1:10

db_test=# EXPLAIN (ANALYZE, TIMING) SELECT id FROM toast_test_4096_threshold; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on toast_test_4096_threshold (cost=0.00..171667.01 rows=500001 width=4) (actual time=0.057..557.537 rows=500000 loops=1) Planning Time: 0.028 ms Execution Time: 576.747 ms (3 rows) db_test=# EXPLAIN (ANALYZE, TIMING) SELECT id FROM toast_test_default_threshold; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on toast_test_default_threshold (cost=0.00..8185.00 rows=500000 width=4) (actual time=0.020..52.318 rows=500000 loops=1) Planning Time: 0.056 ms Execution Time: 69.657 ms (3 rows)

这个时候,如果我们创建索引,选择索引access 的方式,只查询一条或者几条的话,差距不是很大。

db_test=# create index concurrently idx_toast_1 on toast_test_4096_threshold (id) ; CREATE INDEX db_test=# create index concurrently idx_toast_2 on toast_test_default_threshold (id) ; CREATE INDEX

性能上基本上 无差别: toast_test_4096_threshold vs toast_test_default_threshold = 0.043 ms : 0.045 ms

db_test=# EXPLAIN (ANALYZE, TIMING) SELECT * FROM toast_test_4096_threshold where id = 1000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_toast_1 on toast_test_4096_threshold (cost=0.42..8.44 rows=1 width=2056) (actual time=0.023..0.024 rows=1 loops=1) Index Cond: (id = 1000) Planning Time: 0.077 ms Execution Time: 0.043 ms (4 rows) db_test=# EXPLAIN (ANALYZE, TIMING) SELECT * FROM toast_test_default_threshold where id = 1000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_toast_2 on toast_test_default_threshold (cost=0.42..8.44 rows=1 width=22) (actual time=0.019..0.020 rows=1 loops=1) Index Cond: (id = 1000) Planning Time: 0.295 ms Execution Time: 0.045 ms (4 rows)

如果进行大量索引扫描的话,差距还是很明显的 : toast_test_4096_threshold vs toast_test_default_threshold = 145.957 ms: 25.064 ms

db_test=# EXPLAIN (ANALYZE, TIMING) SELECT * FROM toast_test_default_threshold where id > 1000 and id < 100000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_toast_2 on toast_test_default_threshold (cost=0.42..3653.04 rows=97581 width=22) (actual time=0.023..20.433 rows=98999 loops=1) Index Cond: ((id > 1000) AND (id < 100000)) Planning Time: 0.182 ms Execution Time: 25.064 ms (4 rows) db_test=# EXPLAIN (ANALYZE, TIMING) SELECT * FROM toast_test_4096_threshold where id > 1000 and id < 100000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_toast_1 on toast_test_4096_threshold (cost=0.42..36565.48 rows=100353 width=2056) (actual time=0.040..142.147 rows=98999 loops=1) Index Cond: ((id > 1000) AND (id < 100000)) Planning Time: 0.196 ms Execution Time: 145.957 ms (4 rows)

至此我们得出结论:

1)全表扫描的情况下, 无论是否 select 选项里有大字段, toast 的表只会扫描指针, 会比本地存储的表,快很多
2)索引扫描的情况下, 如果索引的选择率高,(high selectivity, high cardinality ), 返回的行少, 那么二者差别不大,
如果索引的选择率低,或者大范围索引扫描的话, 本地存储会比toast 的慢很多。

最后share 几个关于 PG 查询表大小的函数的区别:

pg_table_size: 会得到表的总大小(包含toast表,不包含索引)
pg_relation_size: 只会得到表本身的大小
pg_total_relation_size:会得到表的全部大小(包含toast, index的大小)

SELECT c1.relname, pg_size_pretty(pg_relation_size(c1.relname::regclass)) AS size, c2.relname AS toast_relname, pg_size_pretty(pg_relation_size(('pg_toast.' || c2.relname)::regclass)) AS toast_size, pg_size_pretty(pg_total_relation_size((c1.relname::regclass)::regclass)-pg_table_size((c1.relname::regclass)::regclass)) AS index_total_size FROM pg_class c1 JOIN pg_class c2 ON c1.reltoastrelid = c2.oid WHERE c1.relname LIKE 'toast_test_%' AND c1.relkind = 'r'; relname | size | toast_relname | toast_size | index_total_size ------------------------------+---------+------------------+------------+------------------ toast_test_default_threshold | 25 MB | pg_toast_4844457 | 1302 MB | 11 MB toast_test_4096_threshold | 1302 MB | pg_toast_4844467 | 0 bytes | 11 MB (2 rows)

Have a fun! 😃

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论