clickhouse在21.8以上版本引入了Projection (投影) 功能,投影类似于物化视图,但在part-level上定义。它提供了一致性保证以及查询中的自动使用
投影存储在分区目录中。它类似于索引,但包含存储匿名MergeTree表分区的子目录。这个表是由投影的定义查询生成的
如果存在GROUP BY子句,底层存储引擎变成AggregatingMergeTree,所有聚合函数都转换为AggregateFunction
如果有ORDER BY子句,则MergeTree表将其用作主键表达式。
在合并过程中,投影分区通过其存储的合并例程进行合并
父表分区的校验和与投影分区相结合
特点
ClickHouse Projection 可以看做是一种更加智能的物化视图,它有如下特点:
part-level 存储:
Projection 物化的数据就保存在原表的分区目录中,
支持明细数据的普通Projection 和 预聚合Projection
无感使用,自动命中:对一张 MergeTree 可以创建多个 Projection ,当执行 Select 语句的时候,能根据查询范围,自动匹配最优的 Projection 提供查询加速。如果没有命中 Projection , 就直接查询底表。匹配规则如下:
Where 必须是 PROJECTION 定义中 GROUP BY 的子集
GROUP BY 必须是 PROJECTION 定义中 GROUP BY 的子集
SELECT 必须是 PROJECTION 定义中 SELECT 的子集
匹配多个 PROJECTION 的时候,选取读取 part 最少的
返回的数据行小于基表总数
查询覆盖的分区 part 超过一半
数据同源、同生共死:因为物化的数据保存在原表的分区,所以数据的更新、合并都是同源的,也就不会出现不一致的情况了
安装新版本
安装支持projection功能的版本
zookeeper单节点部署
[root@yejf opt]# tar -zxvf apache-zookeeper-3.6.3-bin.tar.gz
[root@yejf opt]# chown -R ck:ck apache-zookeeper-3.6.3-bin
# vim etc/profile
export ZK_HOME=/opt/apache-zookeeper-3.6.3-bin
export PATH=$PATH:$ZK_HOME/bin
[root@yejf ~]$ source etc/profile
[root@yejf ~]$ cp $ZK_HOME/conf/zoo_sample.cfg $ZK_HOME/conf/zoo.cfg
[root@yejf conf]$ vim $ZK_HOME/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/tmp/zookeeper
clientPort=2181
[root@yejf ~]# zkServer.sh start
ZooKeeper JMX enabled by default
Using config: opt/apache-zookeeper-3.6.3-bin/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@yejf ~]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: opt/apache-zookeeper-3.6.3-bin/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost. Client SSL: false.
Mode: standalone
[root@yejf ~]# jps
42356 QuorumPeerMain
42447 Jps
jdk已部署好
[ck@yejf ~]$ java -version
java version "1.8.0_201"
安装ck
会提示输入default的密码
[root@yejf ck]# ll
总用量 1022580
-rw-rw-r-- 1 ck ck 91312 2月 24 09:59 clickhouse-client-21.9.7.2-2.noarch.rpm
-rw-rw-r-- 1 ck ck 187997076 2月 24 10:00 clickhouse-common-static-21.9.7.2-2.x86_64.rpm
-rw-rw-r-- 1 ck ck 858908632 2月 24 10:00 clickhouse-common-static-dbg-21.9.7.2-2.x86_64.rpm
-rw-rw-r-- 1 ck ck 114929 2月 24 10:00 clickhouse-server-21.9.7.2-2.noarch.rpm
[root@yejf ck]# rpm -ivh *
警告:clickhouse-client-21.9.7.2-2.noarch.rpm: 头V4 RSA/SHA1 Signature, 密钥 ID e0c56bd4: NOKEY
准备中... ################################# [100%]
正在升级/安装...
1:clickhouse-common-static-21.9.7.2################################# [ 25%]
2:clickhouse-client-21.9.7.2-2 ################################# [ 50%]
3:clickhouse-server-21.9.7.2-2 ################################# [ 75%]
ClickHouse binary is already located at usr/bin/clickhouse
Symlink usr/bin/clickhouse-server already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-server to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-client already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-client to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-local already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-local to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-benchmark already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-benchmark to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-copier already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-copier to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-obfuscator already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-obfuscator to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-git-import to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-compressor already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-compressor to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-format already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-format to usr/bin/clickhouse.
Symlink usr/bin/clickhouse-extract-from-config already exists but it points to clickhouse. Will replace the old symlink to usr/bin/clickhouse.
Creating symlink usr/bin/clickhouse-extract-from-config to usr/bin/clickhouse.
Creating clickhouse group if it does not exist.
groupadd -r clickhouse
Creating clickhouse user if it does not exist.
useradd -r --shell bin/false --home-dir nonexistent -g clickhouse clickhouse
Will set ulimits for clickhouse user in etc/security/limits.d/clickhouse.conf.
Creating config directory etc/clickhouse-server/config.d that is used for tweaks of main server configuration.
Creating config directory etc/clickhouse-server/users.d that is used for tweaks of users configuration.
Config file etc/clickhouse-server/config.xml already exists, will keep it and extract path info from it.
/etc/clickhouse-server/config.xml has var/lib/clickhouse/ as data path.
/etc/clickhouse-server/config.xml has var/log/clickhouse-server/ as log path.
Users config file etc/clickhouse-server/users.xml already exists, will keep it and extract users info from it.
chown --recursive clickhouse:clickhouse '/etc/clickhouse-server'
Creating log directory var/log/clickhouse-server/.
Creating data directory var/lib/clickhouse/.
Creating pid directory var/run/clickhouse-server.
chown --recursive clickhouse:clickhouse '/var/log/clickhouse-server/'
chown --recursive clickhouse:clickhouse '/var/run/clickhouse-server'
chown clickhouse:clickhouse '/var/lib/clickhouse/'
groupadd -r clickhouse-bridge
useradd -r --shell bin/false --home-dir nonexistent -g clickhouse-bridge clickhouse-bridge
chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-odbc-bridge'
chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-library-bridge'
Enter password for default user:
Password for default user is saved in file etc/clickhouse-server/users.d/default-password.xml.
Setting capabilities for clickhouse binary. This is optional.
ClickHouse has been successfully installed.
Start clickhouse-server with:
sudo clickhouse start
Start clickhouse-client with:
clickhouse-client --password
Created symlink from etc/systemd/system/multi-user.target.wants/clickhouse-server.service to etc/systemd/system/clickhouse-server.service.
4:clickhouse-common-static-dbg-21.9################################# [100%]
#启动
[root@yejf opt]# clickhouse start
chown --recursive clickhouse '/var/run/clickhouse-server/'
Will run su -s bin/sh 'clickhouse' -c '/usr/bin/clickhouse-server --config-file etc/clickhouse-server/config.xml --pid-file var/run/clickhouse-server/clickhouse-server.pid --daemon'
Waiting for server to start
Waiting for server to start
Server started
config.xml文件开放监听
<listen_host>0.0.0.0</listen_host> <!-- 开放监听 -->登陆
[root@yejf opt]# clickhouse-client --password ck123
ClickHouse client version 21.9.7.2 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.9.7 revision 54449.
yejf :)
启动投影功能
修改配置文件user.xml
<?xml version="1.0"?>
<yandex>
<!-- See also the files in users.d directory where the settings can be overridden. -->
<!-- Profiles of settings. -->
<profiles>
<!-- Default settings. -->
<default>
<allow_experimental_projection_optimization>1</allow_experimental_projection_optimization>
<!-- Maximum memory usage for processing single query, in bytes. -->
<max_memory_usage>10000000000</max_memory_usage>
...
yejf :) select name,value from system.settings where name='allow_experimental_projection_optimization';
Query id: 4cb27612-3c0c-417e-ad3a-c5246f408853
┌─name───────────────────────────────────────┬─value─┐
│ allow_experimental_projection_optimization │ 1 │
└────────────────────────────────────────────┴───────┘
测试数据
https://clickhouse.com/docs/zh/getting-started/example-datasets/metrica/
curl -O https://datasets.clickhouse.com/hits/partitions/hits_v1.tar
tar xvf hits_v1.tar -C var/lib/clickhouse
chown -R clickhouse:clickhouse var/lib/clickhouse/data/datasets
chown -R clickhouse:clickhouse var/lib/clickhouse/metadata/datasets
clickhouse restart
[root@yejf opt]# clickhouse-client --password ck123 --query "SELECT COUNT(*) FROM datasets.hits_v1"
基准测试参考:https://blog.csdn.net/penriver/article/details/122433039
Prejection使用
查看表的排序键
yejf :) show create table datasets.hits_v1;
...
ENGINE = MergeTree
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity = 8192
在没有Projection的时候,查询非主键WatchID
yejf :) select WatchID,Title from datasets.hits_v1 where WatchID=4611686725751467379 limit 10;
┌─────────────WatchID─┬─Title────────────────────────────────────────┐
│ 4611686725751467379 │ Трена-2003 ME293 | Скоратов, брюнетки пойдет │
└─────────────────────┴──────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.047 sec. Processed 8.87 million rows, 71.70 MB (187.96 million rows/s., 1.52 GB/s.)
共查询8百万行数据,读取数据71MB,耗时0.047秒
创建Project,按查询需要生成一种有主键的排序规则
ALTER TABLE datasets.hits_v1 ADD PROJECTION p1(
SELECT
WatchID,Title,
ORDER BY WatchID
);
project只有在创建后,再被写入数据才会自动物化
对于历史数据,需要手动物化
alter table datasets.hits_v1 MATERIALIZE PROJECTION p1;MATERIALIZE PROJECTION 是一个异步的 Mutation 操作,可以通过下面的语句查询状态
SELECT table,mutation_id,command,is_done,create_time FROM system.mutations order by create_time desc;
┌─table───┬─mutation_id─────┬─command───────────────────┬─is_done─┐
│ hits_v1 │ mutation_19.txt │ DROP PROJECTION p1 │ 1 │
│ hits_v1 │ mutation_20.txt │ MATERIALIZE PROJECTION p1 │ 1 │
└─────────┴─────────────────┴───────────────────────────┴─────────┘
通过系统表查看prejection的信息
SELECT
name,
partition,
formatReadableSize(bytes_on_disk) AS bytes,
formatReadableSize(parent_bytes_on_disk) AS parent_bytes,
parent_rows,
rows parent_rows AS ratio
FROM system.projection_parts
Query id: b868af4f-b0a6-4380-8db6-0b0cc18637d6
┌─name─┬─partition─┬─bytes──────┬─parent_bytes─┬─parent_rows─┬─ratio─┐
│ p1 │ 201403 │ 440.99 MiB │ 1.61 GiB │ 8873898 │ 1 │
└──────┴───────────┴────────────┴──────────────┴─────────────┴───────┘
再次使用WatchID条件查询
yejf :) select WatchID,Title from datasets.hits_v1 where WatchID=4611686725751467379 limit 10;
Query id: 3414cd7d-c627-46c9-9cba-61a7ad60bc40
┌─────────────WatchID─┬─Title────────────────────────────────────────┐
│ 4611686725751467379 │ Трена-2003 ME293 | Скоратов, брюнетки пойдет │
└─────────────────────┴──────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.012 sec. Processed 8.19 thousand rows, 65.62 KB (661.16 thousand rows/s., 5.30 MB/s.)
没有Projection之前,共查询8百万行数据,读取数据71MB,耗时0.047秒,有Projection之后,共查询8千行数据,读取数据65kb,耗时0.012,性能提升数倍,当数量大后性能提升更加明显
从查看表结构中可以看到projection的定义
yejf :) show create table datasets.hits_v1;
...
PROJECTION p1
(
SELECT
WatchID,
Title
ORDER BY WatchID
)
...
从执行计划可以看到有无使用projection
yejf :) explain select WatchID,Title from datasets.hits_v1 where WatchID=4611686725751467379 limit 10;
Query id: 5e9b5734-8c72-48e4-ac2a-9e164b0c6123
┌─explain───────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│ Limit (preliminary LIMIT (without OFFSET)) │
│ ReadFromStorage (MergeTree(with 0 projection p1)) │
└───────────────────────────────────────────────────────────────────────────┘
4 rows in set. Elapsed: 0.010 sec.
删除projection
ALTER TABLE datasets.hits_v1 DROP PROJECTION p1;Projection本质是空间换时间,可以对比看到在有无Projection前后的数据大小
SELECT
table AS `表名`,
sum(rows) AS `总行数`,
formatReadableSize(sum(data_uncompressed_bytes)) AS `原始大小`,
formatReadableSize(sum(data_compressed_bytes)) AS `压缩大小`,
round((sum(data_compressed_bytes) sum(data_uncompressed_bytes)) * 100, 0) AS `压缩率`
FROM system.parts
WHERE table IN ('hits_v1')
GROUP BY table;
#前
┌─表名────┬───总行数─┬─原始大小──┬─压缩大小─┬─压缩率─┐
│ hits_v1 │ 17747796 │ 11.46 GiB │ 2.36 GiB │ 21 │
└─────────┴──────────┴───────────┴──────────┴────────┘
#后
┌─表名────┬───总行数─┬─原始大小──┬─压缩大小─┬─压缩率─┐
│ hits_v1 │ 17747796 │ 11.46 GiB │ 2.36 GiB │ 21 │
└─────────┴──────────┴───────────┴──────────┴────────┘
创建group by的Projection
ALTER TABLE datasets.hits_v1 ADD PROJECTION agg_p2(
SELECT
UserID,
SearchPhrase,
count()
GROUP BY UserID, SearchPhrase
);
alter table datasets.hits_v1 MATERIALIZE PROJECTION agg_p2;
在查看计划中看到使用了agg_p2的Projection
yejf :) explain select UserID,count(*) from datasets.hits_v1 group by UserID order by UserID;
┌─explain───────────────────────────────────────────────────────────────────────────┐
│ Expression (Projection) │
│ MergingSorted (Merge sorted streams for ORDER BY) │
│ MergeSorting (Merge sorted blocks for ORDER BY) │
│ PartialSorting (Sort each block for ORDER BY) │
│ Expression (Before ORDER BY) │
│ SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│ ReadFromStorage (MergeTree(with 1 projection agg_p2)) │
└───────────────────────────────────────────────────────────────────────────────────┘
投影功能的优化使用
需要优化的sql
SELECT
months,
sum(pay_money) AS total_money,
uniqCombined(order_id_number) AS total_order_count
FROM test.order_detail_local
WHERE years IN (2020) AND (province IN ('广东', '广西', '海南'))
GROUP BY months;
优化前的性能
没有缓存时9秒
有缓存时0.4秒
优化方案1:创建group by投影
ALTER TABLE test.order_detail_local DROP PROJECTION p1;
ALTER TABLE test.order_detail_local ADD PROJECTION p1(
SELECT years,months,province,sum(pay_money),uniqCombined(order_id_number)
group BY years,months,province
);
alter table test.order_detail_local MATERIALIZE PROJECTION p1;
测试发现有无缓存时间都差不多是2秒左右,但查询资源少
优化方案2:创建order by 投影
ALTER TABLE test.order_detail_local DROP PROJECTION p1;
ALTER TABLE test.order_detail_local
ADD PROJECTION p1
(
SELECT
years,
months,
province,
pay_money,
order_id_number
ORDER BY (years, months, province)
);
alter table test.order_detail_local MATERIALIZE PROJECTION p1;
投影未创建成功时会报错,可能是当前版本的bug
SELECT
months,
sum(pay_money) AS total_money,
uniqCombined(order_id_number) AS total_order_count
FROM test.order_detail_local
WHERE (years IN (2020)) AND (province IN ('广东', '广西', '海南'))
GROUP BY months
Query id: a46da1a3-16cd-4d63-83ab-a16a87bcc8c4
0 rows in set. Elapsed: 0.044 sec.
Received exception from server (version 21.9.7):
Code: 49. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Block structure mismatch in Pipe::unitePipes stream: different names of columns:
months UInt32 UInt32(size = 0)
order_id_number UInt64 UInt64(size = 0). (LOGICAL_ERROR)
有缓存时查询是0.5秒,没有缓存时查询是1秒多,但查询的资源比group by的投影多
优化方案3:创建物化视图
drop VIEW v_test1;
CREATE MATERIALIZED VIEW v_test1
ENGINE=MergeTree() ORDER BY (months) POPULATE
AS
SELECT years,months,
sum(pay_money) as total_money
,uniqCombined(order_id_number) as total_order_count
from test.order_detail_local
WHERE (years IN (2020)) AND (province IN ('广东', '广西', '海南'))
group by years,months;
查询性能很快,因为是一张新表了




