暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Clickhouse投影功能Projection的使用

叶同学专栏 2022-04-28
2579

clickhouse在21.8以上版本引入了Projection (投影) 功能,投影类似于物化视图,但在part-level上定义。它提供了一致性保证以及查询中的自动使用

  • 投影存储在分区目录中。它类似于索引,但包含存储匿名MergeTree表分区的子目录。这个表是由投影的定义查询生成的

  • 如果存在GROUP BY子句,底层存储引擎变成AggregatingMergeTree,所有聚合函数都转换为AggregateFunction

  • 如果有ORDER BY子句,则MergeTree表将其用作主键表达式。

  • 在合并过程中,投影分区通过其存储的合并例程进行合并

  • 父表分区的校验和与投影分区相结合

特点

ClickHouse Projection 可以看做是一种更加智能的物化视图,它有如下特点:

part-level 存储:

  • Projection 物化的数据就保存在原表的分区目录中,

  • 支持明细数据的普通Projection 和 预聚合Projection

无感使用,自动命中:对一张 MergeTree 可以创建多个 Projection ,当执行 Select 语句的时候,能根据查询范围,自动匹配最优的 Projection 提供查询加速。如果没有命中 Projection , 就直接查询底表。匹配规则如下:

  • Where 必须是 PROJECTION 定义中 GROUP BY 的子集

  • GROUP BY 必须是 PROJECTION 定义中 GROUP BY 的子集

  • SELECT 必须是 PROJECTION 定义中 SELECT 的子集

  • 匹配多个 PROJECTION 的时候,选取读取 part 最少的

  • 返回的数据行小于基表总数

  • 查询覆盖的分区 part 超过一半

  • 数据同源、同生共死:因为物化的数据保存在原表的分区,所以数据的更新、合并都是同源的,也就不会出现不一致的情况了

安装新版本

安装支持projection功能的版本

zookeeper单节点部署

[root@yejf opt]# tar -zxvf apache-zookeeper-3.6.3-bin.tar.gz 
[root@yejf opt]# chown -R ck:ck apache-zookeeper-3.6.3-bin
# vim etc/profile
  export ZK_HOME=/opt/apache-zookeeper-3.6.3-bin
  export PATH=$PATH:$ZK_HOME/bin
[root@yejf ~]$ source etc/profile

[root@yejf ~]$ cp $ZK_HOME/conf/zoo_sample.cfg $ZK_HOME/conf/zoo.cfg
[root@yejf conf]$ vim $ZK_HOME/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/tmp/zookeeper
clientPort=2181

[root@yejf ~]# zkServer.sh start
ZooKeeper JMX enabled by default
Using config: opt/apache-zookeeper-3.6.3-bin/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@yejf ~]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: opt/apache-zookeeper-3.6.3-bin/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost. Client SSL: false.
Mode: standalone
[root@yejf ~]# jps
42356 QuorumPeerMain
42447 Jps

jdk已部署好

[ck@yejf ~]$ java -version
java version "1.8.0_201"

安装ck

会提示输入default的密码

[root@yejf ck]# ll
总用量 1022580
-rw-rw-r-- 1 ck ck     91312 2月 24 09:59 clickhouse-client-21.9.7.2-2.noarch.rpm
-rw-rw-r-- 1 ck ck 187997076 2月 24 10:00 clickhouse-common-static-21.9.7.2-2.x86_64.rpm
-rw-rw-r-- 1 ck ck 858908632 2月 24 10:00 clickhouse-common-static-dbg-21.9.7.2-2.x86_64.rpm
-rw-rw-r-- 1 ck ck   114929 2月 24 10:00 clickhouse-server-21.9.7.2-2.noarch.rpm
[root@yejf ck]# rpm -ivh *
警告:clickhouse-client-21.9.7.2-2.noarch.rpm: 头V4 RSA/SHA1 Signature, 密钥 ID e0c56bd4: NOKEY
准备中...                         ################################# [100%]
正在升级/安装...
  1:clickhouse-common-static-21.9.7.2################################# [ 25%]
  2:clickhouse-client-21.9.7.2-2     ################################# [ 50%]
  3:clickhouse-server-21.9.7.2-2     ################################# [ 75%]
ClickHouse binary is already located at usr/bin/clickhouse
Symlink usr/bin/clickhouse-server already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-server to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-client already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-client to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-local already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-local to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-benchmark already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-benchmark to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-copier already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-copier to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-obfuscator already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-obfuscator to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-git-import to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-compressor already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-compressor to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-format already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-format to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-extract-from-config already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-extract-from-config to usr/bin/clickhouse.
Creating clickhouse group if it does not exist.
groupadd -r clickhouse
Creating clickhouse user if it does not exist.
useradd -r --shell bin/false --home-dir nonexistent -g clickhouse clickhouse
Will set ulimits for clickhouse user in etc/security/limits.d/clickhouse.conf.
Creating config directory etc/clickhouse-server/config.d that is used for tweaks of main server configuration.
Creating config directory etc/clickhouse-server/users.d that is used for tweaks of users configuration.
Config file etc/clickhouse-server/config.xml already exists, will keep it and extract path info from it.
/etc/clickhouse-server/config.xml has var/lib/clickhouse/ as data path.
/etc/clickhouse-server/config.xml has var/log/clickhouse-server/ as log path.
Users config file etc/clickhouse-server/users.xml already exists, will keep it and extract users info from it.
chown --recursive clickhouse:clickhouse '/etc/clickhouse-server'
Creating log directory var/log/clickhouse-server/.
Creating data directory var/lib/clickhouse/.
Creating pid directory var/run/clickhouse-server.
chown --recursive clickhouse:clickhouse '/var/log/clickhouse-server/'
chown --recursive clickhouse:clickhouse '/var/run/clickhouse-server'
chown clickhouse:clickhouse '/var/lib/clickhouse/'
groupadd -r clickhouse-bridge
useradd -r --shell bin/false --home-dir nonexistent -g clickhouse-bridge clickhouse-bridge
chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-odbc-bridge'
chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-library-bridge'
Enter password for default user:
Password for default user is saved in file etc/clickhouse-server/users.d/default-password.xml.
Setting capabilities for clickhouse binary. This is optional.

ClickHouse has been successfully installed.

Start clickhouse-server with:
sudo clickhouse start

Start clickhouse-client with:
clickhouse-client --password

Created symlink from etc/systemd/system/multi-user.target.wants/clickhouse-server.service to etc/systemd/system/clickhouse-server.service.
  4:clickhouse-common-static-dbg-21.9################################# [100%]
 
#启动
[root@yejf opt]# clickhouse start
chown --recursive clickhouse '/var/run/clickhouse-server/'
Will run su -s bin/sh 'clickhouse' -c '/usr/bin/clickhouse-server --config-file etc/clickhouse-server/config.xml --pid-file var/run/clickhouse-server/clickhouse-server.pid --daemon'
Waiting for server to start
Waiting for server to start
Server started

config.xml文件开放监听

    <listen_host>0.0.0.0</listen_host> <!-- 开放监听 -->

登陆

[root@yejf opt]# clickhouse-client --password ck123
ClickHouse client version 21.9.7.2 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.9.7 revision 54449.

yejf :)

启动投影功能

修改配置文件user.xml

<?xml version="1.0"?>
<yandex>
  <!-- See also the files in users.d directory where the settings can be overridden. -->

  <!-- Profiles of settings. -->
  <profiles>
      <!-- Default settings. -->
      <default>
          <allow_experimental_projection_optimization>1</allow_experimental_projection_optimization>
          <!-- Maximum memory usage for processing single query, in bytes. -->
          <max_memory_usage>10000000000</max_memory_usage>
          ...

yejf :) select name,value from system.settings where name='allow_experimental_projection_optimization';
Query id: 4cb27612-3c0c-417e-ad3a-c5246f408853
┌─name───────────────────────────────────────┬─value─┐
│ allow_experimental_projection_optimization │ 1     │
└────────────────────────────────────────────┴───────┘

测试数据

https://clickhouse.com/docs/zh/getting-started/example-datasets/metrica/

curl -O https://datasets.clickhouse.com/hits/partitions/hits_v1.tar
tar xvf hits_v1.tar -C var/lib/clickhouse
chown -R clickhouse:clickhouse var/lib/clickhouse/data/datasets
chown -R clickhouse:clickhouse var/lib/clickhouse/metadata/datasets
clickhouse restart
[root@yejf opt]# clickhouse-client --password ck123 --query "SELECT COUNT(*) FROM datasets.hits_v1"

基准测试参考:https://blog.csdn.net/penriver/article/details/122433039

Prejection使用

查看表的排序键

yejf :) show create table datasets.hits_v1;
...
ENGINE = MergeTree
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity = 8192

在没有Projection的时候,查询非主键WatchID

yejf :) select WatchID,Title from datasets.hits_v1 where WatchID=4611686725751467379 limit 10;
┌─────────────WatchID─┬─Title────────────────────────────────────────┐
│ 4611686725751467379 │ Трена-2003 ME293 | Скоратов, брюнетки пойдет │
└─────────────────────┴──────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.047 sec. Processed 8.87 million rows, 71.70 MB (187.96 million rows/s., 1.52 GB/s.)

共查询8百万行数据,读取数据71MB,耗时0.047秒

创建Project,按查询需要生成一种有主键的排序规则

ALTER TABLE datasets.hits_v1 ADD PROJECTION p1( 
  SELECT
    WatchID,Title,
  ORDER BY WatchID
);

project只有在创建后,再被写入数据才会自动物化

对于历史数据,需要手动物化

alter table datasets.hits_v1 MATERIALIZE PROJECTION p1;

MATERIALIZE PROJECTION 是一个异步的 Mutation 操作,可以通过下面的语句查询状态

SELECT table,mutation_id,command,is_done,create_time FROM system.mutations order by create_time desc;
┌─table───┬─mutation_id─────┬─command───────────────────┬─is_done─┐
│ hits_v1 │ mutation_19.txt │ DROP PROJECTION p1       │       1 │
│ hits_v1 │ mutation_20.txt │ MATERIALIZE PROJECTION p1 │       1 │
└─────────┴─────────────────┴───────────────────────────┴─────────┘

通过系统表查看prejection的信息

SELECT
  name,
  partition,
  formatReadableSize(bytes_on_disk) AS bytes,
  formatReadableSize(parent_bytes_on_disk) AS parent_bytes,
  parent_rows,
  rows parent_rows AS ratio
FROM system.projection_parts

Query id: b868af4f-b0a6-4380-8db6-0b0cc18637d6

┌─name─┬─partition─┬─bytes──────┬─parent_bytes─┬─parent_rows─┬─ratio─┐
│ p1   │ 201403   │ 440.99 MiB │ 1.61 GiB     │     8873898 │     1 │
└──────┴───────────┴────────────┴──────────────┴─────────────┴───────┘

再次使用WatchID条件查询

yejf :) select WatchID,Title from datasets.hits_v1 where WatchID=4611686725751467379 limit 10;
Query id: 3414cd7d-c627-46c9-9cba-61a7ad60bc40
┌─────────────WatchID─┬─Title────────────────────────────────────────┐
│ 4611686725751467379 │ Трена-2003 ME293 | Скоратов, брюнетки пойдет │
└─────────────────────┴──────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.012 sec. Processed 8.19 thousand rows, 65.62 KB (661.16 thousand rows/s., 5.30 MB/s.)

没有Projection之前,共查询8百万行数据,读取数据71MB,耗时0.047秒,有Projection之后,共查询8千行数据,读取数据65kb,耗时0.012,性能提升数倍,当数量大后性能提升更加明显

从查看表结构中可以看到projection的定义

yejf :) show create table datasets.hits_v1;
...
  PROJECTION p1
  (
      SELECT
          WatchID,
          Title
      ORDER BY WatchID
  )
...

从执行计划可以看到有无使用projection

yejf :) explain  select WatchID,Title from datasets.hits_v1 where WatchID=4611686725751467379 limit 10;
Query id: 5e9b5734-8c72-48e4-ac2a-9e164b0c6123
┌─explain───────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))                               │
│   SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│     Limit (preliminary LIMIT (without OFFSET))                           │
│       ReadFromStorage (MergeTree(with 0 projection p1))                   │
└───────────────────────────────────────────────────────────────────────────┘
4 rows in set. Elapsed: 0.010 sec.

删除projection

ALTER TABLE datasets.hits_v1 DROP PROJECTION p1;

Projection本质是空间换时间,可以对比看到在有无Projection前后的数据大小

SELECT
  table AS `表名`,
  sum(rows) AS `总行数`,
  formatReadableSize(sum(data_uncompressed_bytes)) AS `原始大小`,
  formatReadableSize(sum(data_compressed_bytes)) AS `压缩大小`,
  round((sum(data_compressed_bytes) sum(data_uncompressed_bytes)) * 100, 0) AS `压缩率`
FROM system.parts
WHERE table IN ('hits_v1')
GROUP BY table;

#前
┌─表名────┬───总行数─┬─原始大小──┬─压缩大小─┬─压缩率─┐
│ hits_v1 │ 17747796 │ 11.46 GiB │ 2.36 GiB │     21 │
└─────────┴──────────┴───────────┴──────────┴────────┘

#后
┌─表名────┬───总行数─┬─原始大小──┬─压缩大小─┬─压缩率─┐
│ hits_v1 │ 17747796 │ 11.46 GiB │ 2.36 GiB │     21 │
└─────────┴──────────┴───────────┴──────────┴────────┘

创建group by的Projection

ALTER TABLE datasets.hits_v1 ADD PROJECTION agg_p2( 
    SELECT
        UserID,
        SearchPhrase,
        count()
      GROUP BY UserID, SearchPhrase
  );
alter table datasets.hits_v1 MATERIALIZE PROJECTION agg_p2;

在查看计划中看到使用了agg_p2的Projection

yejf :) explain select UserID,count(*) from datasets.hits_v1 group by UserID order by UserID;
┌─explain───────────────────────────────────────────────────────────────────────────┐
│ Expression (Projection)                                                           │
│   MergingSorted (Merge sorted streams for ORDER BY)                               │
│     MergeSorting (Merge sorted blocks for ORDER BY)                               │
│       PartialSorting (Sort each block for ORDER BY)                               │
│         Expression (Before ORDER BY)                                             │
│           SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│             ReadFromStorage (MergeTree(with 1 projection agg_p2))                 │
└───────────────────────────────────────────────────────────────────────────────────┘

投影功能的优化使用

需要优化的sql

SELECT
  months,
  sum(pay_money) AS total_money,
  uniqCombined(order_id_number) AS total_order_count
FROM test.order_detail_local
WHERE years IN (2020) AND (province IN ('广东', '广西', '海南'))
GROUP BY months;

优化前的性能

没有缓存时9秒

有缓存时0.4秒

优化方案1:创建group by投影

ALTER TABLE test.order_detail_local DROP PROJECTION p1;

ALTER TABLE test.order_detail_local ADD PROJECTION p1(
  SELECT years,months,province,sum(pay_money),uniqCombined(order_id_number)
  group BY years,months,province
);
alter table test.order_detail_local MATERIALIZE PROJECTION p1;

测试发现有无缓存时间都差不多是2秒左右,但查询资源少

优化方案2:创建order by 投影

ALTER TABLE test.order_detail_local DROP PROJECTION p1;

ALTER TABLE test.order_detail_local
  ADD PROJECTION p1
  (
      SELECT
          years,
          months,
          province,
          pay_money,
          order_id_number
      ORDER BY (years, months, province)
  );
   
alter table test.order_detail_local MATERIALIZE PROJECTION p1;

投影未创建成功时会报错,可能是当前版本的bug

SELECT
  months,
  sum(pay_money) AS total_money,
  uniqCombined(order_id_number) AS total_order_count
FROM test.order_detail_local
WHERE (years IN (2020)) AND (province IN ('广东', '广西', '海南'))
GROUP BY months

Query id: a46da1a3-16cd-4d63-83ab-a16a87bcc8c4


0 rows in set. Elapsed: 0.044 sec.

Received exception from server (version 21.9.7):
Code: 49. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Block structure mismatch in Pipe::unitePipes stream: different names of columns:
months UInt32 UInt32(size = 0)
order_id_number UInt64 UInt64(size = 0). (LOGICAL_ERROR)

有缓存时查询是0.5秒,没有缓存时查询是1秒多,但查询的资源比group by的投影多

优化方案3:创建物化视图

drop VIEW v_test1;
CREATE MATERIALIZED VIEW v_test1  
ENGINE=MergeTree() ORDER BY (months) POPULATE
AS
SELECT years,months,
  sum(pay_money) as total_money
  ,uniqCombined(order_id_number) as total_order_count
from test.order_detail_local
WHERE (years IN (2020)) AND (province IN ('广东', '广西', '海南'))
group by years,months;

查询性能很快,因为是一张新表了


文章转载自叶同学专栏,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论