作者
digoal
日期
2020-07-02
标签
PostgreSQL , zedstore , 列存储
背景
zedstore行列存储, 基于PG的AM接口开发, 是目前大型PG商业公司pivotal开源的一款存储插件, 源码如下
https://github.com/greenplum-db/postgres/tree/zedstore
设计文档见
https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com
https://github.com/greenplum-db/postgres/blob/zedstore/src/backend/access/zedstore/README
为什么zedstore是行列混合存储? 有些不需要统计的列, 可以打包到一起作为行存, 提高kv检索效率。 需要统计的列拆开存储, 缩小统计查询的IO扫描量同时提高压缩比。
zedstore解决了什么业务场景问题?
1、频繁更新, 垃圾回收不及时(IO差的块设备常见问题, 长事务常见问题)导致的索引、HEAP表膨胀, 导致性能下降.
2、32位事务号导致的需要间歇性强制freeze, 在频繁更新的场景会触发较大量IO, 导致性能抖动.
3、OLAP分析场景, 大宽表, 任意列组合透视分析性能问题.
前两个是PG目前heap引擎的通用问题, 第3个是PG作为一个高效能的OLTP+OLAP混合数据库必须解决的问题.
可以说如果zedstore稳定发布, 是PG的一个杀手级存储引擎.
使用例子
```
git clone https://github.com/greenplum-db/postgres/
git branch -a
git checkout -b remotes/origin/zedstore
```
```
useradd digoal
mkdir /data01/digoal
chown digoal:digoal /data01/digoal
```
```
cp -r postgres /home/digoal/
chown -R digoal:digoal /home/digoal/postgres
```
安装lz4, 否则zedstore会很慢
yum install -y lz4-devel
su - digoal
cd postgres
./configure --prefix=/home/digoal/pg_zedstore --with-lz4
make world -j 4
make install-world
```
vi ~/.bash_profile
追加
export PS1="$USER@/bin/hostname -s-> "
export PGPORT=1921
export PGDATA=/data01/digoal/pg_root
export LANG=en_US.utf8
export PGHOME=/home/digoal/pg_zedstore
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=date +"%Y%m%d%H%M"
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
```
initdb -D $PGDATA -U postgres -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8
vi postgresql.conf
listen_addresses = '0.0.0.0'
port = 1921
max_connections = 300
unix_socket_directories = '/tmp, .'
unix_socket_permissions = 0700
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 8GB
maintenance_work_mem = 1GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
synchronous_commit = off
wal_compression = on
wal_writer_delay = 10ms
max_wal_size = 24GB
min_wal_size = 8GB
random_page_cost = 1.1
log_destination = 'csvlog'
logging_collector = on
log_checkpoints = on
log_timezone = 'Asia/Shanghai'
log_autovacuum_min_duration = 0
autovacuum_freeze_max_age = 1200000000
autovacuum_multixact_freeze_max_age = 1400000000
autovacuum_vacuum_cost_delay = 0ms
vacuum_freeze_table_age = 1150000000
vacuum_multixact_freeze_table_age = 1150000000
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.utf8'
lc_monetary = 'en_US.utf8'
lc_numeric = 'en_US.utf8'
lc_time = 'en_US.utf8'
default_text_search_config = 'pg_catalog.english'
pg_ctl start
测试数据
postgres=# do language plpgsql $$
declare
sql text;
begin
sql := 'create table t_heap (id int8,';
for i in 1..10 loop
sql := sql||'int'||i||' int,';
end loop;
for i in 1..10 loop
sql := sql||'text'||i||' text,';
end loop;
sql := sql||'crt_time timestamp) using heap';
execute sql;
end;
$$;
postgres=# do language plpgsql $$
declare
sql text;
begin
sql := 'create table t_zedstore (id int8,';
for i in 1..10 loop
sql := sql||'int'||i||' int,';
end loop;
for i in 1..10 loop
sql := sql||'text'||i||' text,';
end loop;
sql := sql||'crt_time timestamp) using zedstore';
execute sql;
end;
$$;
```
postgres=# \d+ t_heap
Table "public.t_heap"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
id | bigint | | | | plain | |
int1 | integer | | | | plain | |
int2 | integer | | | | plain | |
int3 | integer | | | | plain | |
int4 | integer | | | | plain | |
int5 | integer | | | | plain | |
int6 | integer | | | | plain | |
int7 | integer | | | | plain | |
int8 | integer | | | | plain | |
int9 | integer | | | | plain | |
int10 | integer | | | | plain | |
text1 | text | | | | extended | |
text2 | text | | | | extended | |
text3 | text | | | | extended | |
text4 | text | | | | extended | |
text5 | text | | | | extended | |
text6 | text | | | | extended | |
text7 | text | | | | extended | |
text8 | text | | | | extended | |
text9 | text | | | | extended | |
text10 | text | | | | extended | |
crt_time | timestamp without time zone | | | | plain | |
Access method: heap
postgres=# \d+ t_zedstore
Table "public.t_zedstore"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
id | bigint | | | | plain | |
int1 | integer | | | | plain | |
int2 | integer | | | | plain | |
int3 | integer | | | | plain | |
int4 | integer | | | | plain | |
int5 | integer | | | | plain | |
int6 | integer | | | | plain | |
int7 | integer | | | | plain | |
int8 | integer | | | | plain | |
int9 | integer | | | | plain | |
int10 | integer | | | | plain | |
text1 | text | | | | extended | |
text2 | text | | | | extended | |
text3 | text | | | | extended | |
text4 | text | | | | extended | |
text5 | text | | | | extended | |
text6 | text | | | | extended | |
text7 | text | | | | extended | |
text8 | text | | | | extended | |
text9 | text | | | | extended | |
text10 | text | | | | extended | |
crt_time | timestamp without time zone | | | | plain | |
Access method: zedstore
```
```
\timing
insert into t_heap
select
id,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()100)::int::text),
md5((random()100)::int::text),
clock_timestamp()
from
generate_series(1,10000000) t(id);
INSERT 0 10000000
Time: 55137.382 ms (00:55.137)
```
```
checkpoint;
insert into t_zedstore
select
id,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()100)::int::text),
md5((random()100)::int::text),
clock_timestamp()
from
generate_series(1,10000000) t(id);
INSERT 0 10000000
Time: 156857.767 ms (02:36.858)
```
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------------+-------+----------+-------------+---------+-------------
public | t_heap | table | postgres | permanent | 4113 MB |
public | t_zedstore | table | postgres | permanent | 1941 MB |
(2 rows)
统计分析测试:
```
postgres=# explain select int1,count(*) from t_heap group by 1;
QUERY PLAN
HashAggregate (cost=676315.01..676315.12 rows=11 width=12)
Group Key: int1
-> Seq Scan on t_heap (cost=0.00..626315.34 rows=9999934 width=4)
(3 rows)
Time: 0.405 ms
postgres=# explain select int1,count(*) from t_zedstore group by 1;
QUERY PLAN
HashAggregate (cost=151860.92..151861.03 rows=11 width=12)
Group Key: int1
-> Seq Scan on t_zedstore (cost=0.00..101861.95 rows=9999795 width=4)
(3 rows)
Time: 0.155 ms
```
```
postgres=# select int1,count(*) from t_zedstore group by 1;
int1 | count
------+---------
5 | 999956
4 | 1000395
10 | 501530
6 | 998901
2 | 999192
7 | 1000936
1 | 999489
8 | 998940
9 | 1000325
3 | 1000801
0 | 499535
(11 rows)
Time: 1343.633 ms (00:01.344)
postgres=# select text1,count(*) from t_zedstore group by 1;
text1 | count
----------------------------------+---------
c4ca4238a0b923820dcc509a6f75849b | 1001059
8f14e45fceea167a5a36dedd4bea2543 | 999705
d3d9446802a44259755d38e6d163e820 | 499813
1679091c5a880faf6fb5e6087eb1b2dc | 1001648
e4da3b7fbbce2345d7772b0674a318d5 | 999392
eccbc87e4b5ce2fe28308fd9f2a7baf3 | 999374
cfcd208495d565ef66e7dff9f98764da | 501183
c9f0f895fb98ab9159f51fd0297e236d | 1001479
45c48cce2e2d7fbdea1afc51c7c6ad26 | 998835
a87ff679a2f3e71d9181a67b7542122c | 999037
c81e728d9d4c2f636f067f89cc14862c | 998475
(11 rows)
Time: 1863.111 ms (00:01.863)
```
```
postgres=# select int1,count(*) from t_heap group by 1;
int1 | count
------+---------
5 | 1000662
4 | 999872
10 | 499901
6 | 1000942
2 | 1000378
7 | 999201
1 | 1001494
8 | 998922
9 | 998764
3 | 999863
0 | 500001
(11 rows)
Time: 2655.144 ms (00:02.655)
postgres=# select text1,count(*) from t_heap group by 1;
text1 | count
----------------------------------+---------
c4ca4238a0b923820dcc509a6f75849b | 999925
8f14e45fceea167a5a36dedd4bea2543 | 1000277
d3d9446802a44259755d38e6d163e820 | 497869
1679091c5a880faf6fb5e6087eb1b2dc | 1000732
e4da3b7fbbce2345d7772b0674a318d5 | 999516
eccbc87e4b5ce2fe28308fd9f2a7baf3 | 1002371
cfcd208495d565ef66e7dff9f98764da | 500511
c9f0f895fb98ab9159f51fd0297e236d | 996678
45c48cce2e2d7fbdea1afc51c7c6ad26 | 1000367
a87ff679a2f3e71d9181a67b7542122c | 1001158
c81e728d9d4c2f636f067f89cc14862c | 1000596
(11 rows)
Time: 3062.112 ms (00:03.062)
```
```
alter table t_heap add constraint pk primary key(id);
alter table t_zedstore add constraint pk1 primary key(id);
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Persistence | Size | Description
--------+------+-------+----------+------------+-------------+--------+-------------
public | pk | index | postgres | t_heap | permanent | 214 MB |
public | pk1 | index | postgres | t_zedstore | permanent | 214 MB |
(2 rows)
```
大宽表明细KV查询测试:
```
vi test1.sql
\set id random(1,10000000)
select * from t_heap where id=:id;
select * from t_zedstore where id=:id;
```
```
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 52 -j 52 -T 120
transaction type: ./test1.sql
scaling factor: 1
query mode: prepared
number of clients: 52
number of threads: 52
duration: 120 s
number of transactions actually processed: 6932745
latency average = 0.900 ms
latency stddev = 0.176 ms
tps = 57772.190636 (including connections establishing)
tps = 57774.347762 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set id random(1,10000000)
0.061 select * from t_heap where id=:id;
0.838 select * from t_zedstore where id=:id;
```
高并发更新测试:
```
vi test2.sql
\set id random(1,100000000)
insert into t_heap values (:id,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()100)::int::text),
md5((random()100)::int::text),
clock_timestamp()
) on conflict (id)
do update set
int1=excluded.int1,
int2=excluded.int2,
int3=excluded.int3,
int4=excluded.int4,
int5=excluded.int5,
int6=excluded.int6,
int7=excluded.int7,
int8=excluded.int8,
int9=excluded.int9,
int10=excluded.int10,
text1=excluded.text1,
text2=excluded.text2,
text3=excluded.text3,
text4=excluded.text4,
text5=excluded.text5,
text6=excluded.text6,
text7=excluded.text7,
text8=excluded.text8,
text9=excluded.text9,
text10=excluded.text10,
crt_time=excluded.crt_time;
insert into t_zedstore values (:id,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
random()10,
random()100,
random()1000,
random()10000,
random()100000,
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()10)::int::text),
md5((random()100)::int::text),
md5((random()1000)::int::text),
md5((random()10000)::int::text),
md5((random()100)::int::text),
md5((random()100)::int::text),
clock_timestamp()
) on conflict (id)
do update set
int1=excluded.int1,
int2=excluded.int2,
int3=excluded.int3,
int4=excluded.int4,
int5=excluded.int5,
int6=excluded.int6,
int7=excluded.int7,
int8=excluded.int8,
int9=excluded.int9,
int10=excluded.int10,
text1=excluded.text1,
text2=excluded.text2,
text3=excluded.text3,
text4=excluded.text4,
text5=excluded.text5,
text6=excluded.text6,
text7=excluded.text7,
text8=excluded.text8,
text9=excluded.text9,
text10=excluded.text10,
crt_time=excluded.crt_time;
```
```
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 52 -j 52 -T 120
progress: 1.0 s, 1096.0 tps, lat 6.098 ms stddev 2.041
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 5.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 6.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 9.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
postgres=# select wait_event from pg_stat_activity ;
wait_event
AutoVacuumMain
LogicalLauncherMain
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
page
buffer_content
buffer_content
page
page
buffer_content
buffer_content
page
buffer_content
buffer_content
page
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
page
buffer_content
page
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
page
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
BgWriterHibernate
CheckpointerMain
WalWriterMain
```
大量冲突, 改成do nothing还是会导致大量冲突, 目前zedstore的高并发写入存在问题.
insert into t_zedstore values (:id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
) on conflict (id)
do nothing;
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 52 -j 52 -T 120
progress: 1.0 s, 9536.4 tps, lat 5.403 ms stddev 2.373
progress: 2.0 s, 9579.0 tps, lat 5.421 ms stddev 2.730
progress: 3.0 s, 9886.0 tps, lat 5.265 ms stddev 2.483
progress: 4.0 s, 10011.0 tps, lat 5.201 ms stddev 2.710
progress: 5.0 s, 10345.0 tps, lat 5.026 ms stddev 2.434
progress: 6.0 s, 9983.0 tps, lat 5.188 ms stddev 2.554
progress: 7.0 s, 9892.0 tps, lat 5.277 ms stddev 2.674
progress: 8.0 s, 10375.0 tps, lat 5.011 ms stddev 2.390
progress: 9.0 s, 10059.0 tps, lat 5.165 ms stddev 2.678
progress: 10.0 s, 10123.0 tps, lat 5.135 ms stddev 2.470
progress: 11.0 s, 5129.5 tps, lat 4.941 ms stddev 2.311
progress: 12.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 13.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 14.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 15.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 16.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 17.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 18.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
等待page lock
```
[root@iZbp135pwcjjoxqgfpw9k1Z ~]# pstack 41065
0 0x00007f5441b9ae43 in __epoll_wait_nocancel () from /lib64/libc.so.6
1 0x000000000075888e in WaitEventSetWait ()
2 0x0000000000758ce9 in WaitLatchOrSocket ()
3 0x00000000007701df in ProcSleep ()
4 0x0000000000764cef in WaitOnLock ()
5 0x0000000000766076 in LockAcquireExtended ()
6 0x0000000000763f46 in LockPage ()
7 0x000000000052d0e3 in zsundo_trim ()
8 0x000000000052e32e in zsundo_get_oldest_undo_ptr ()
9 0x0000000000520b95 in zsbt_tid_begin_scan ()
10 0x000000000052a54f in zedstoream_fetch_row ()
11 0x000000000052a9c1 in zedstoream_index_fetch_tuple ()
12 0x00000000004d29b2 in index_fetch_heap ()
13 0x00000000004d2a1b in index_getnext_slot ()
14 0x000000000062cd2b in check_exclusion_or_unique_constraint ()
15 0x000000000062d73e in ExecCheckIndexConstraints ()
16 0x0000000000653de6 in ExecInsert ()
17 0x0000000000655219 in ExecModifyTable ()
18 0x000000000062e092 in standard_ExecutorRun ()
19 0x000000000077c02a in ProcessQuery ()
20 0x000000000077c258 in PortalRunMulti ()
21 0x000000000077cc6d in PortalRun ()
22 0x000000000077a902 in PostgresMain ()
23 0x0000000000482278 in ServerLoop ()
24 0x0000000000709d63 in PostmasterMain ()
25 0x0000000000482ebe in main ()
```
等待buffer_content lock
```
[root@iZbp135pwcjjoxqgfpw9k1Z ~]# pstack 41063
0 0x00007f544279aafb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
1 0x00007f544279ab8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
2 0x00007f544279ac2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
3 0x00000000006f7f52 in PGSemaphoreLock ()
4 0x0000000000769a5c in LWLockAcquire ()
5 0x000000000051c8c3 in zsbt_descend ()
6 0x0000000000521143 in zsbt_tid_multi_insert ()
7 0x0000000000528010 in zedstoream_insert_speculative ()
8 0x000000000065417f in ExecInsert ()
9 0x0000000000655219 in ExecModifyTable ()
10 0x000000000062e092 in standard_ExecutorRun ()
11 0x000000000077c02a in ProcessQuery ()
12 0x000000000077c258 in PortalRunMulti ()
13 0x000000000077cc6d in PortalRun ()
14 0x000000000077a902 in PostgresMain ()
15 0x0000000000482278 in ServerLoop ()
16 0x0000000000709d63 in PostmasterMain ()
17 0x0000000000482ebe in main ()
```
性能对比
引擎 | 存储空间 | OLTP KV查询 | OLAP 列级聚合统计
---|--- | --- | ---
heap | 1千万 4GB | 0.061 毫秒 | 2.6 秒
zedstore | 1千万 2GB | 0.838 毫秒 | 1.3 秒
目前zedstore的并发写入还存在问题, 查询没有问题.
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.
9.9元购买3个月阿里云RDS PostgreSQL实例
PostgreSQL 解决方案集合
德哥 / digoal's github - 公益是一辈子的事.





