Clickhouse索引介绍

手机用户2895 2023-09-25

4012

1. 概述

ClickHouse索引主要是用于加快查询速度，包括主键索引和跳数索引（skipping index）等，其中跳数索引包含minmax、set、bloom filter、token bloom filter、ngram bloom filter、inverted、hypothesis、annoy 与 usearch 等索引。

本文将先简单介绍主键索引，然后描述各类跳数索引原理和CK实现，着重讲解倒排索引，同时利用立创数据进行实验各种bloom filter、inverted 与 fts 等，最后分析跳数索引在IMCI中可行性。

2. 主键索引

在传统关系数据库，每个表行包含一个主索引，由于索引允许快速定位特定的行，因此可以提高点查和更新的效率。但这种能力是有代价的，比如额外的磁盘和内存开销以及更高的插入成本。

ClickHouse 主要追求高并发插入和较大规模数据，因此 ClickHouse 不是为每一行创建索引，而是为一组数据块（granule，类似于imci row group）构建一个索引条目。与基于btree的索引直接定位到单行（如基于btree的索引)不同，ClickHouse 稀疏主键索引则先通过对索引项进行二分查找来识别出可能匹配查询的数据块，然后以并行方式加载与扫描数据块来找出匹配的行。

按列以固定行数来建构数据块，其按照主键列(以及排序键其它列)的字典序存储在磁盘上：
undefined

每一数据块第一行称为mark行：
undefined

主键索引存储该mark行对应的主键值：
undefined

如果没有使用 PRIMARY KEY 显式指定主键，ClickHouse 会使用排序键（由 ORDER BY 子句指定）作为主键，当然如果不需要排序也可以使用 ORDER BY tuple()。默认情况下主键跟排序键相同，若同时指定了主键和排序键则主键必须是排序键的前缀，当然一些特殊的表引擎也支持主键与排序键不同。此外，ClickHouse 也提供分区能力，其分区键（PARTITION BY）和主键(PRIMARY KEY)是两种不同的加速数据查询的方式，应用时应当尽量错开使用不同列来定义分区键和主键以便覆盖更多的查询场景。

更详细描述参见 https://clickhouse.com/docs/en/optimize/sparse-primary-indexes

3. 跳数索引

1. 跳数索引

在大多数查询场景中，ClickHouse 通过主键索引进行二分查找并直接定位到对应数据块，避免全表扫描从而有效提升查询性能。但并非每一 where 条件均有主键，如此则需要对该列值进行完整扫描。

为了减少全表扫描，ClickHouse 还提供一种不同类型的索引：跳数索引(skipping index)。不同于提供快速定位能力的主键索引，跳数索引主要用于查询时跳过保证没有匹配的数据块。当需要加速查询时，主键通常是首先考虑的手段。然而表只限于一个主键，对于多种场景无法有效利用主键的查询，另外也存在一些使用主键列但无法利用主键的情况，如like “%xxx%” 等。在这些情况下，ClickHouse 可能会被迫对每个列执行全表扫描，此时则可以考虑跳数索引加速这些查询，不可避免全表扫描时尽量减少IO和匹配计算等。当然跳数索引并非总是正收益，尤其列与表达式间存在弱相关的，这时跳数索引将匹配绝大多数块，无法起到过滤作用，但存在索引开销，此时可以考虑修改排序键或分区键、采用投影或物化视图等，其中投影是比跳数索引更像的“索引”。

ClickHouse 提供 minmax、set、bloom filter、token bloom filter、ngram bloom filter、inverted、hypothesis、annoy 与 usearch 等跳数索引类型，下面各小节分别介绍其作用、用法与实现等。

创建跳表索引可以通过 create table 时指定
INDEX index_name expr TYPE type_name(…) GRANULARITY granularity_value
或者创建表后再 add index 添加
alter table table_name ADD INDEX index_name expr TYPE type_name(…) GRANULARITY granularity_value;
其主要有四个参数：

索引名称：索引名用于在每个分区中创建索引文件；在删除或具体化索引时需要将其作为参数；
索引的表达式：索引表达式用于计算存储在索引中的值集，可以是列、简单操作符、函数的子集的组合；
类型：索引的类型，可为 minmax、set、bloom_filter、tokenbf_v1、ngrambf_v1、inverted、hypothesis、annoy 与 usearch 等；
GRANULARITY：指定索引块为表粒度（建表时指定的index_granularity）倍数，如主表索引粒度为 8K 行，GRANULARITY 为4，则每个索引块大小为 32K 行；

更详细描述参见
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-data_skipping-indexes
https://clickhouse.com/docs/en/optimize/skipping-indexes

2. minmax 索引

minmax 索引需要存储每个块的最小值和最大值，存储代价比较小；查询时这种索引类型的开销通常是最小的；在等值和范围查询中能够帮助快速跳过不满足要求的块，减少IO。

创建 minmax 索引

create table skip_minmax_table (id UInt64, data UInt64) ENGINE MergeTree primary key id SETTINGS index_granularity=64;
insert into skip_minmax_table select rand(), number from numbers(100000000);

alter table skip_minmax_table add index skip_minmax_index data type minmax;

新创建的跳数索引只应用于新插入的数据；若让其对已有数据生效，则需要物化，但此为异步构建，返回后并不代表索引可使用；其它跳数索引也类似。

alter table skip_minmax_table materialize index skip_minmax_index;

创建后可以通过 show 或 select 来查看

show create table skip_minmax_table;
select * from system.data_skipping_indices;

可以通过 explain indexs = 1 来查看索引使用情况

explain select * from skip_minmax_table where data = 10;

┌─explain─────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ ReadFromMergeTree (default.skip_minmax_table) │
└─────────────────────────────────────────────────┘

explain indexes = 1 select * from skip_minmax_table where data = 10;

┌─explain─────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ ReadFromMergeTree (default.skip_minmax_table) │
│ Indexes: │
│ PrimaryKey │
│ Condition: true │
│ Parts: 6/6 │
│ Granules: 1562504/1562504 │
│ Skip │
│ Name: skip_minmax_index │
│ Description: minmax GRANULARITY 1 │
│ Parts: 2/6 │
│ Granules: 1179506/1562504 │
└─────────────────────────────────────────────────┘

可以 set send_logs_level=‘trace’; 来执行来输出索引过滤性

select * from skip_minmax_table where data = 10;

default.skip_minmax_table (475557b6-89b7-42be-9455-f5737589bedf) (SelectExecutor): Index `skip_minmax_index` has dropped 1562493/1562504 granules.

3. set 索引

set 索引主要存储指定表达式的 distinct 值，创建时需要指定每块值集合不重复值数目，若为0则表示存储无限数量的离散值；若一个索引块中唯一值超过指定配置大小则该数据块 set 索引将不生效。

set 索引适用于每块基数较小（即块区分度低）但总体基数较高的列（即总体区分度较高），就是说相同数据尽量可能聚集在一块，如按 order key 构建时；也适用高基数表达式，比如错误码或支付状态等，利用 set 索引过滤掉绝大多数不包含错误的数据块来提升性能。

该索引的成本、性能和有效性取决于块中的基数，如当每个块包含大量惟一值，则存储需要大量空间，查询也将比较耗时。鉴于 set 索引有其高效场景，imci 感觉也可以支持。

create table skip_set_table (id UInt64, data String) ENGINE MergeTree primary key id SETTINGS index_granularity=4096;
insert into skip_set_table select rand(), toString((number % 3) + 3) from numbers(100000000);

alter table skip_set_table add index skip_set_index data type set(10);

4. bloom_filter 索引

ClickHouse 提供三种基于 bloom filter 跳数索引类型：bloom_filter、tokenbf_v1 与 ngrambf_v1。

bloom_filter([false_positive])索引对指定列构建 bloom filter，可指定假阳性率参数，取值范围是 (0,1)，默认为0.025，其用于加速等值、like、in等查询，但不适用于范围类查询。

bloom filter 允许对集合成员进行高效的是否存在测试，但代价是有轻微的误报。对于跳数索引来说，假阳性不影响正确性，仅是多读取一些不必要的块而已。

bloom filter 适用于数据块中较多离散值的情况和大量条件表达式判断的场景等。然而 bloom filter 可能会包含不符合条件的匹配，因此 bloom filter 索引不能用于结果返回为假的函数。比如可以优化 c = 0、not c != 0 与 c like '%xxx%" 等，但不能用于 not c = 0、c != 0 与 not like “%xxx%” 等。

create table skip_bf_table (id UInt64, data String) ENGINE MergeTree primary key id;
drop table skip_bf_table;
insert into skip_bf_table select rand(), toString(rand()%10000) from numbers(100000000);

alter table skip_bf_table add index skip_bf_index(data) TYPE bloom_filter(0.1) GRANULARITY 1;
alter table skip_bf_table materialize index skip_bf_index;
optimize table skip_bf_table final;

查询时 bloom_filter 索引过滤性

select count() from skip_bf_table where data = '3';

1 row in set. Elapsed: 0.097 sec. Processed 100.00 million rows, 1.29 GB (1.03 billion rows/s., 13.23 GB/s.)
Peak memory usage: 11.92 MiB.

1 row in set. Elapsed: 0.107 sec. Processed 60.76 million rows, 783.12 MB (566.57 million rows/s., 7.30 GB/s.)
Peak memory usage: 11.81 MiB.

5. tokenbf_v1 索引

tokenbf_v1 索引类似于 bloom_filter 索引，不同在于 tokenbf_v1 先对指定列按 token 分词后再构建 bloom filter，常用于 like 等查询。基于 token 分词是指按空格、标点符号或其它非字母数字等分割符进行分割出词语集合，如 “I am ‘imci’” 则分割出 “I”、“am” 与 “imci”。

tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed) 包含三个参数：

size_of_bloom_filter_in_bytes：bloom filter 大小，字节为单位；bloom filter 越大会有更少的假阳性但有更高的存储成本；
number_of_hash_functions：bloom filter 使用的散列函数数目，更多散列函数可能会减少假阳性，但有增加计算量；
random_seed：bloom filter 散列函数的随机种子；

其用法类似 bloom_filter 索引。

6. ngrambf_v1 索引

ngrambf_v1 索引类似于 tokenbf_v1 索引，不同在于 ngrambf_v1 采用 ngram 分词方式，常用于 like 等查询。基于 ngram 分词主要是按指定字符数目进行分割，如 “I am ‘imci’” 则 ngram(2) 分割出 “I “、” a”、“am”、“m “、” '”、"‘i"、“im”、“mc” 与 “ci”，常用于没有空格分隔的语言，如中文。

ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed) 包含四个参数，其中 n 为连词长度，即要索引的ngram的大小；其它参数同 tokenbf_v1。

其用法类似 bloom_filter 索引。

7. inverted 索引

inverted 索引详细参见 ClickHouse 倒排索引

inverted(N) 用于创建倒排索引，其中

inverted(0) 或 inverted() 采用 token 分词方式；
inverted(N) 2 <= N <= 8，采用 ngram(N) 分词方式；

需要先开启，其只影响创建与删除等但不影响查询，而 use_skip_indexes 影响查询，默认开启；

set allow_experimental_inverted_index = true;

创建 inverted 索引

create table skip_inverted_table (id UInt64, val String) ENGINE MergeTree primary key id;
insert into skip_inverted_table select rand(), toString(rand()%1000000000 + 1000000000) from numbers(100000000);
alter table skip_inverted_table add index skip_inverted_index val TYPE inverted(5) GRANULARITY 1;
alter table skip_inverted_table materialize index skip_inverted_index;
optimize table skip_inverted_table final;

inverted 大概有1到8倍提升，like越长则提升越明显；若 like 中长度小于指定 ngrams 参数则索引不生效；

select count() from skip_inverted_table where val like '%6274156%';
1 row in set. Elapsed: 0.014 sec. Processed 221.18 thousand rows, 4.20 MB (15.28 million rows/s., 290.40 MB/s.)

select count() from skip_inverted_table where val like '%627415%';
1 row in set. Elapsed: 0.024 sec. Processed 2.50 million rows, 47.47 MB (104.93 million rows/s., 1.99 GB/s.)

select count() from skip_inverted_table where val like '%62741%';
1 row in set. Elapsed: 0.042 sec. Processed 20.86 million rows, 396.28 MB (493.03 million rows/s., 9.37 GB/s.)

select count() from skip_inverted_table where val like '%6271%';
1 row in set. Elapsed: 0.093 sec. Processed 100.00 million rows, 1.90 GB (1.08 billion rows/s., 20.54 GB/s.)

也支持中文

create table skip_inverted_table2 (id UInt64, val String, INDEX skip_inverted_index2(val) TYPE inverted(2)) ENGINE MergeTree ORDER BY id SETTINGS index_granularity = 2;
insert into skip_inverted_table2 values (1, 'imci好'), (2, '你好imci'), (3, 'polardb你'), (4, 'imc你i好'), (5, 'imci好你'), (6, 'i好m好c你i你');

select * from skip_inverted_table2 where val like '%你好%';

default.skip_inverted_table2 (20eb3c14-1121-463d-a5cd-e8ad8051fc55) (SelectExecutor): Index `skip_inverted_index2` has dropped 2/3 granules.
┌─id─┬─val──────┐
│  2 │ 你好imci │
└────┴──────────┘
1 row in set. Elapsed: 0.002 sec.

8. hypothesis索引

hypothesis索引创建时指定 pred 表达式，CK 会计算每一行表达式值得到 true / false 集合，在查询时根据查询条件与索引表达式进行对比来判断索引是否生效；

create table skip_hypothesis_table (id UInt64, data UInt64) ENGINE MergeTree order by ();
insert into skip_hypothesis_table select rand(), number from numbers(10000000);

https://clickhouse.com/blog/whats-new-in-clickhouse-21-12
The expression is checked and the result (true/false) is written as an index for query optimization.

alter table skip_hypothesis_table add index skip_hypothesis_index (data > 1000) TYPE hypothesis GRANULARITY 1;
alter table skip_hypothesis_table materialize index skip_hypothesis_index;
optimize table skip_hypothesis_table final;

SELECT count() FROM skip_hypothesis_table WHERE data <= 900;

1 row in set. Elapsed: 0.004 sec. Processed 8.19 thousand rows, 65.54 KB (2.17 million rows/s., 17.38 MB/s.)
Peak memory usage: 444.46 KiB.

SELECT count() FROM skip_hypothesis_table WHERE data <= 900 SETTINGS use_skip_indexes = 0;

1 row in set. Elapsed: 0.012 sec. Processed 10.00 million rows, 80.00 MB (818.81 million rows/s., 6.55 GB/s.)
Peak memory usage: 11.79 MiB.

9. annoy 索引

最近邻域搜索(Nearest Neighbor Search)是指在 N 维向量空间中找到与给定点距离最小的点的问题，如 Annoy 搜索算法。
对于多维向量数据，若进行全表扫描且逐条对比，其性能和内存更为挑战；因此，ClickHouse提供 ANN (Approximate Nearest Neighbour) 索引来过滤无效数据，其采用 Annoy (https://github.com/spotify/annoy) 与 USearch(https://github.com/unum-cloud/usearch) 库来实现；

annoy 索引由使用 Annoy 库来实现；
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes
https://github.com/ClickHouse/ClickHouse/commit/e7abc06c89002d28ebfc10beac1e164a335f47e6

set allow_experimental_annoy_index = 1;

create table skip_annoy_table (id UInt64, datas Array(Float32)) ENGINE MergeTree order by id settings index_granularity = 2;
insert into skip_annoy_table values(0, [0.0,0.0,1.1]), (1, [0.0,0.0,0.9]), (3, [0.0,0.0,10.0]), (4, [0.0,0.0,50.1]), (5, [0.0,0.0,100.0]);

alter table skip_annoy_table drop index skip_annoy_index;
alter table skip_annoy_table add index skip_annoy_index datas TYPE annoy GRANULARITY 1;
alter table skip_annoy_table add index skip_annoy_index datas TYPE annoy('L2Distance', 10) GRANULARITY 1;
optimize table skip_annoy_table final;

SELECT * FROM skip_annoy_table WHERE L2Distance(datas, [0.0,0.0,1.0]) <= 1.0 limit 1;

10. usearch 索引

usearch 索引由使用 USearch 库来实现；
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes
https://github.com/ClickHouse/ClickHouse/commit/48c62fd75e2977ee9b23ceea070782188536c2ba

set allow_experimental_usearch_index = 1;

create table skip_usearch_table (id UInt64, datas Array(Float32)) ENGINE MergeTree order by id settings index_granularity = 5;

insert into skip_usearch_table values(1, [0.0, 0.0, 10.0]), (2, [0.0, 0.0, 10.5]), (3, [0.0, 0.0, 9.5]), (4, [0.0, 0.0, 9.7]), (5, [0.0, 0.0, 10.2]), (6, [10.0, 0.0, 0.0]), (7, [9.5, 0.0, 0.0]), (8, [9.7, 0.0, 0.0]), (9, [10.2, 0.0, 0.0]), (10, [10.5, 0.0, 0.0]), (11, [0.0, 10.0, 0.0]), (12, [0.0, 9.5, 0.0]), (13, [0.0, 9.7, 0.0]), (14, [0.0, 10.2, 0.0]), (15, [0.0, 10.5, 0.0]);

alter table skip_usearch_table add index skip_usearch_index datas TYPE usearch() GRANULARITY 1;
optimize table skip_usearch_table final;

SELECT * FROM skip_usearch_table WHERE L2Distance(datas, [0.0,0.0,10.0]) <= 1.0 limit 3;

11. 函数支持

各类索引对函数的支持列表

undefined

对于 like 操作，除外 bloom_filter 索引外均支持；

4.实验

在立创数据上测试 CK ngrambf_v1与inverted等跳数索引的过滤效果.

https://aliyuque.antfin.com/polarm/oncxfu/cgg3igmppvap3vef

5.总结

立创临时方案：

= 或 >= 或 <= 带 COLLATE 时 pruner 存在正确性bug, done；
imci用minmax pruner来支持"like"与"like%"过滤, done；
imci添加ngrambf pruner来支持"like"、"%like"、“like%”、"%like%"过滤, doing；
imci用reverse minmax pruner来支持"%like"过滤, 代价也是非常小的, 如果不支持则也可以利用第4点；
用fts来做"%like%“过滤,但还需要保留”%like%"以保证正确性, 即 against(‘xx’) and like ‘%xxx%’；
pred的child次序可以根据pruner是否生效和有效性进行调整；
在没有索引情况下，"%like%" ck(0.7s)也比imci快(2.2s)；

imci产品功能：

经立创数据验证, inverted 与 ngrambf_v1 过滤性都还可以，inverted 更佳；
阅读 like/inverted/ngrambf_v1 等代码与测试来验证数据构建时间和内存磁盘量等，合适就直接port过来；倒排索引可以按pack粒度在后台慢构建，不影响插入/更新/dll等；等到构建完成后才开始使用；
CK 优化点：four string hash table & jit(1.5-3.0 times)

TODO：
排序后测试 “like” 与 “%like” 有效性；
当删除率比较高时 bloomfilter 可能会比较低效，同时测试 imci ngram bloom filter 有效性；

polardb

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者