openGauss每日一练第21天|行存储和列存储

原创 Garen 2021-12-22

732

最后一天的打卡，我们学习行储存与列储存。

行存储是指将表按行存储到硬盘分区上，列存储是指将表按列存储到硬盘分区上。默认情况下，创建的表为行存储。

行、列存储模型各有优劣，通常用于TP场景的数据库，默认使用行存储，仅对执行复杂查询且数据量大的AP场景时，才使用列存储。

暂时还没有接触到复杂查询和数据量大的应用场景，希望早日能接触到。

课程学习

1.创建行存表

CREATE TABLE test_t1
(
col1 CHAR(2),
col2 VARCHAR2(40),
col3 NUMBER
);

–压缩属性为no

\d+ test_t1
insert into test_t1 select col1, col2, col3 from
(select
generate_series(1, 100000) as key,
repeat(chr(int4(random() * 26) + 65), 2) as col1,
repeat(chr(int4(random() * 26) + 65), 30) as col2, 
(random() * (10^4))::integer as col3
);

2.创建列存表

CREATE TABLE test_t2
(
col1 CHAR(2),
col2 VARCHAR2(40),
col3 NUMBER
)
WITH (ORIENTATION = COLUMN);

就在建表的最后加一行就行了。

–压缩属性为low

\d+ test_t2

–插入和行存表相同的数据

insert into test_t2 select * from test_t1;

3.占用空间对比

\d+

public | test_t1              | table | omm   | 6760 kB    | {orientation=row,compression=no}     
public | test_t2              | table | omm   | 1112 kB    | {orientation=column,compression=low}

明显列储存的占用空间更小。

4.对比读取一列的速度

omm=# analyze VERBOSE test_t1;
INFO:  analyzing "public.test_t1"(gaussdb pid=1)
INFO:  ANALYZE INFO : "test_t1": scanned 841 of 841 pages, containing 100000 live rows and 0 dead rows; 30000 rows in sample, 100000 estimated total rows(gaussdb pid=1)
ANALYZE
omm=# analyze VERBOSE test_t2;
INFO:  analyzing "public.test_t2"(gaussdb pid=1)
INFO:  ANALYZE INFO : estimate total rows of "pg_delta_16441": scanned 0 pages of total 0 pages with 1 retry times, containing 0 live rows and 0 dead rows,  estimated 0 total rows(gaussdb pid=1)
INFO:  ANALYZE INFO : "test_t2": scanned 2 of 2 cus, sample 30000 rows, estimated total 100000 rows(gaussdb pid=1)
ANALYZE

–列存表时间少于行存表

omm=# explain analyze select distinct col1 from test_t1;

                                                     QUERY PLAN                                                     
 
--------------------------------------------------------------------------------------------------------------------
-
 HashAggregate  (cost=2091.00..2091.27 rows=27 width=3) (actual time=51.888..51.892 rows=27 loops=1)
   Group By Key: col1
   ->  Seq Scan on test_t1  (cost=0.00..1841.00 rows=100000 width=3) (actual time=0.011..25.021 rows=100000 loops=1)
 Total runtime: 51.951 ms
(4 rows)

omm=# explain analyze select distinct col1 from test_t2;

                                                         QUERY PLAN                                                 
        
--------------------------------------------------------------------------------------------------------------------
--------
 Row Adapter  (cost=1008.27..1008.27 rows=27 width=3) (actual time=4.239..4.242 rows=27 loops=1)
   ->  Vector Sonic Hash Aggregate  (cost=1008.00..1008.27 rows=27 width=3) (actual time=4.235..4.236 rows=27 loops=
1)
         Group By Key: col1
         ->  CStore Scan on test_t2  (cost=0.00..758.00 rows=100000 width=3) (actual time=0.069..0.337 rows=100000 l
oops=1)
 Total runtime: 4.344 ms
(5 rows)

5.对比插入一行的速度

–行存表时间少于列存表

omm=# explain analyze insert into test_t1 values('x', 'xxxx', '123');
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 [Bypass]
 Insert on test_t1  (cost=0.00..0.01 rows=1 width=0) (actual time=0.072..0.073 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Total runtime: 0.177 ms
(4 rows)

omm=# explain analyze insert into test_t2 values('x', 'xxxx', '123');
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Insert on test_t2  (cost=0.00..0.01 rows=1 width=0) (actual time=3.024..3.025 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)
 Total runtime: 3.122 ms
(3 rows)

6.清理数据

drop table test_t1;
drop table test_t2;

课程作业

1.创建行存表和列存表，并批量插入10万条数据(行存表和列存表数据相同)

CREATE TABLE test_t3
(
id1 int,
id2 int,
id3 int
);
create table test_t4 (like test_t3) with (orientation = column);

insert into test_t3 select id1, id2, id3 from
(select 
generate_series(1, 100000) as key,
(random() * (10^2))::int as id1,
(random() * (10^3))::int as id2,
(random() * (10^4))::int as id3
);
insert into test_t4 select * from test_t3;

2.对比行存表和列存表空间大小

\d+
 public | test_t3              | table | omm   | 4352 kB    | {orientation=row,compression=no}
 public | test_t4              | table | omm   | 536 kB     | {orientation=column,compression=low}

明显列存表空间更小。

3.对比查询一列和插入一行的速度

比较插入列的速度，则列存表时间要少于行存表。

omm=# analyze VERBOSE test_t3;
INFO:  analyzing "public.test_t3"(gaussdb pid=1)
INFO:  ANALYZE INFO : "test_t3": scanned 541 of 541 pages, containing 100000 live rows and 0 dead rows; 30000 rows in sample, 100000 estimated total rows(gaussdb pid=1)
ANALYZE
omm=# analyze verbose test_t4;
INFO:  analyzing "public.test_t4"(gaussdb pid=1)
INFO:  ANALYZE INFO : estimate total rows of "pg_delta_16460": scanned 0 pages of total 0 pages with 1 retry times, containing 0 live rows and 0 dead rows,  estimated 0 total rows(gaussdb pid=1)
INFO:  ANALYZE INFO : "test_t4": scanned 2 of 2 cus, sample 30000 rows, estimated total 100000 rows(gaussdb pid=1)
ANALYZE


omm=# explain analyze select distinct col1 from test_t3;
                                                     QUERY PLAN                                                     
 
--------------------------------------------------------------------------------------------------------------------
-
 HashAggregate  (cost=1791.00..1792.01 rows=101 width=4) (actual time=46.334..46.350 rows=101 loops=1)
   Group By Key: id1
   ->  Seq Scan on test_t3  (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.015..24.127 rows=100000 loops=1)
 Total runtime: 46.421 ms
(4 rows)


omm=# explain analyze select distinct col1 from test_t4;
                                                         QUERY PLAN                                                 
        
--------------------------------------------------------------------------------------------------------------------
--------
 Row Adapter  (cost=525.01..525.01 rows=101 width=4) (actual time=2.632..2.639 rows=101 loops=1)
   ->  Vector Sonic Hash Aggregate  (cost=524.00..525.01 rows=101 width=4) (actual time=2.629..2.629 rows=101 loops=
1)
         Group By Key: id1
         ->  CStore Scan on test_t4  (cost=0.00..274.00 rows=100000 width=4) (actual time=0.038..0.417 rows=100000 l
oops=1)
 Total runtime: 2.740 ms
(5 rows)

比较插入行的速度，则行存表时间要少于列存表（这里由于都是int所以区别不是很大，存字符串的话差别就大了）

omm=# explain analyze insert into test_t3 values (3, 2, 1);
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 [Bypass]
 Insert on test_t3  (cost=0.00..0.01 rows=1 width=0) (actual time=0.069..0.070 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Total runtime: 0.175 ms
(4 rows)

omm=# explain analyze insert into test_t4 values (3, 2, 1);
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Insert on test_t4  (cost=0.00..0.01 rows=1 width=0) (actual time=0.126..0.127 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Total runtime: 0.226 ms
(3 rows)

4.清理数据

drop table test_t3;
drop table test_t4;

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者