
本文字数:11882;估计阅读时间:30 分钟
作者:ClickHouse Team

Meetup活动
ClickHouse 杭州第二届 Meetup 开启报名,详见文末海报!

又到了每月一次的版本更新时间!
发布概要
ClickHouse 25.5 带来了 15 项新功能 🌸 23 项性能优化 🦋 以及 64 个缺陷修复 🐝
本次发布的亮点包括:向量相似度索引进入 Beta 阶段、Iceberg 表支持 Hive metastore catalog、支持查看每个函数何时被添加到 ClickHouse、本地工具 clickhouse-local 引入隐式表支持等诸多功能更新。

我们热烈欢迎所有在 25.5 版本中首次做出贡献的开发者!ClickHouse 社区正在持续壮大,正是因为大家的努力,ClickHouse 才能不断发展壮大。
以下是本次的新贡献者名单:
Andrian Iliev, Colerar, Dasha Wessely, Denis, KovalevDima, Kyamran, Marta Paes, Mojtaba Ghahari, Sachin Singh, YjyJeff, andrei tinikov, caicre, codeworse, gvoelfin, morsapaes, samay-sharma, shanfengp, tdufour

贡献者:scanhex12
ClickHouse 正在不断拓展对 Lakehouse 架构的支持。此前的版本已支持 Unity 和 AWS Glue 数据目录,本次新增了对 Hive metastore catalog 的支持。
用户可以通过 DataLakeCatalog 表引擎创建一个表来连接 Hive catalog:
SET allow_experimental_database_hms_catalog=true;CREATE DATABASE hiveENGINE = DataLakeCatalog('thrift://hive:9083', '', '')SETTINGScatalog_type = 'hive',warehouse = 'demo',storage_endpoint = 'http://minio:9000/warehouse';
随后,便可以像操作普通表一样查询该 catalog 中的表:
SELECT count()FROM default.`iceberg.hits`;

贡献者:Alexey Milovidov
在使用 clickhouse-local 工具处理标准输入文件时,现在可以省略 FROM 和 SELECT 子句。
举个例子,假设我们想要解析 GitHub 上 football.json 项目中的一个 JSON 文件,可以使用如下查询来探索文件中的属性结构:
curl -sS 'https://raw.githubusercontent.com/openfootball/football.json/refs/heads/master/2024-25/en.1.json' |./clickhouse -q "JSONAllPathsWithTypes(json)" --input-format JSONAsObject
{'matches':'Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))','name':'String'}
从查询结果中,我们可以看到两个属性:name 和 matches。接下来,我们可以进一步探索 matches 属性的内容:
curl -sS https://raw.githubusercontent.com/openfootball/football.json/refs/heads/master/2024-25/en.1.json |./clickhouse -q "arrayJoin(json.matches::Array(JSON)) LIMIT 5" --input-format JSONAsObject
{"date":"2024-08-16","round":"Matchday 1","score":{"ft":["1","0"],"ht":["0","0"]},"team1":"Manchester United FC","team2":"Fulham FC","time":"20:00"}{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["0","2"],"ht":["0","0"]},"team1":"Ipswich Town FC","team2":"Liverpool FC","time":"12:30"}{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["2","0"],"ht":["1","0"]},"team1":"Arsenal FC","team2":"Wolverhampton Wanderers FC","time":"15:00"}{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["0","3"],"ht":["0","1"]},"team1":"Everton FC","team2":"Brighton & Hove Albion FC","time":"15:00"}{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["1","0"],"ht":["1","0"]},"team1":"Newcastle United FC","team2":"Southampton FC","time":"15:00"}
如果我们希望对 matches 数组执行更深入的查询操作,就需要将查询语句改写为包含 FROM 子句的形式,如下所示:
curl -sS https://raw.githubusercontent.com/openfootball/football.json/refs/heads/master/2024-25/en.1.json |./clickhouse -q \"m.date, m.team1, m.team2, m.score.ft::Array(String)[1] || '-' || m.score.ft::Array(String)[2] AS scoreFROM tableARRAY JOIN json.matches::Array(JSON) AS mLIMIT 5" \--input-format JSONAsObject \--output-format Pretty \--output_format_pretty_row_numbers=0
此外,我们还支持通过指定输出格式,在查询结果中包含列名:
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓┃ m.date ┃ m.team1 ┃ m.team2 ┃ score ┃┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩│ 2024-08-16 │ Manchester United FC │ Fulham FC │ 1-0 │├────────────┼──────────────────────┼────────────────────────────┼───────┤│ 2024-08-17 │ Ipswich Town FC │ Liverpool FC │ 0-2 │├────────────┼──────────────────────┼────────────────────────────┼───────┤│ 2024-08-17 │ Arsenal FC │ Wolverhampton Wanderers FC │ 2-0 │├────────────┼──────────────────────┼────────────────────────────┼───────┤│ 2024-08-17 │ Everton FC │ Brighton & Hove Albion FC │ 0-3 │├────────────┼──────────────────────┼────────────────────────────┼───────┤│ 2024-08-17 │ Newcastle United FC │ Southampton FC │ 1-0 │└────────────┴──────────────────────┴────────────────────────────┴───────┘

贡献者:Robert Schulzesy
stem.functions 系统表新增了一个字段 introduced_in,用于记录每个函数首次被添加到 ClickHouse 的版本号。目前该字段已经覆盖了一部分新添加的函数,未来也会陆续补充旧函数的信息。
下面是 25.5 版本中新增的函数列表:
SELECT name, introduced_in, returned_valueFROM system.functionsWHERE introduced_in = '25.5'FORMAT PrettySpace;
name descriptionicebergBucket Implements logic of iceberg bucket transform: https://iceberg.apache.org/spec/#bucket-transform-details.icebergHash Implements logic of iceberg hashing transform: https://iceberg.apache.org/spec/#appendix-b-32-bit-hash-requirements.mapContainsValue Checks whether the map has the specified value.mapContainsValueLike Checks whether map contains value LIKE specified pattern.mapContainsKey Checks whether the map has the specified key.mapExtractValueLike Returns a map with elements which value matches the specified pattern.stringBytesEntropy Calculates Shannon's entropy of byte distribution in a string.stringBytesUniq Counts the number of distinct bytes in a string.

贡献者:scanhex12
ClickHouse 的 Parquet 读取器现已支持 Geo 类型。我们可以通过以下查询语句,对 Overture Maps 数据集中的几何字段进行解析:
SELECT id, geometry, bbox,COLUMNS('^id|geometry|bbox') APPLY(toTypeName)FROM s3('s3://overturemaps-us-west-2/release/2025-04-23.0/theme=places/type=place/*')WHERE categories.primary = 'pizza_restaurant'LIMIT 1SETTINGS schema_inference_make_columns_nullable = 0;
Row 1:──────id: 08f95008132998a6036bed7d0b7e7661geometry: (119.4407421,-5.15814)bbox: (119.44073,119.44074,-5.15814,-5.158139)toTypeName(id): StringtoTypeName(geometry): Tuple(Float64, Float64)toTypeName(bbox): Tuple(xmin Float32,xmax Float32,ymin Float32,ymax Float32)
现在,geometry 字段可以被正确识别并解析。而在旧版本中,该字段通常会以二进制的形式显示,如下所示:
Row 1:──────id: 08ff2a1b744f064403498c7aa7fce7a1geometry: (-149.069345,-83.5991503)bbox: (-149.06935,-149.06932,-83.59915,-83.599144)toTypeName(id): StringtoTypeName(geometry): PointtoTypeName(bbox): Tuple(xmin Float32,xmax Float32,ymin Float32,ymax Float32)
有了这个改进,ClickHouse 已可用于查询 GeoParquet 数据集。我们也一直在寻找有趣的地理数据集,欢迎大家推荐值得尝试的数据源!

贡献者:Shankar Iyer
ClickHouse 中的向量相似度索引功能现已从实验阶段升级为 Beta 阶段,并新增了对混合搜索中前置过滤和后置过滤策略的支持。
我们以下面这段示例,展示如何基于 Cohere 的 Wikipedia 数据子集使用向量索引。首先,我们启用相似度索引功能:
SET allow_experimental_vector_similarity_index=1;
接着,我们创建一个包含 emb 向量字段的表,并基于该字段建立 HNSW 索引:
CREATE TABLE wikiEmbeddings (id Int32,title String,text String,url String,wiki_id Int32,views Float32,paragraph_id Int32,langs Int32,emb Array(Float32),INDEX emb_hnsw emb TYPE vector_similarity('hnsw', 'L2Distance', 768))ORDER BY id;
然后从 Hugging Face 导入相应数据:
INSERT INTO wikiEmbeddingsselect *FROM url('https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings/resolve/main/data/train-00000-of-00253-8d3dffb4e6ef0304.parquet')SETTINGS max_http_get_redirects=10;
现在就可以开始查询了。以下查询将返回数据集中最相似的一条记录:
SELECT * EXCEPT embFROM wikiEmbeddingsWHERE id = 120356;
Row 1:──────id: 120356title: Germany national football teamtext: The Germany national football team () represents Germany in men's international football and played its first match in 1908. The team is governed by the German Football Association ("Deutscher Fußball-Bund"), founded in 1900. Between 1949 and 1990, separate German national teams were recognised by FIFA due to Allied occupation and division: the DFB's team representing the Federal Republic of Germany (commonly referred to as West Germany in English between 1949 and 1990), the Saarland team representing the Saar Protectorate (1950–1956) and the East Germany team representing the German Democratic Republic (1952–1990). The latter two were absorbed along with their records; the present team represents the reunified Federal Republic. The official name and code "Germany FR (FRG)" was shortened to "Germany (GER)" following reunification in 1990.url: https://en.wikipedia.org/wiki?curid=250204wiki_id: 250204views: 3271.844paragraph_id: 0langs: 98
结果显示,这条记录来自 Wikipedia 上关于德国国家足球队的页面。我们可以进一步查询与该记录最相似的其他条目:
WITH (SELECT emb FROM wikiEmbeddings WHERE id = 120356) AS lookupSELECT id, title, text, url, L2Distance(emb, lookup) AS distFROM wikiEmbeddingsORDER BY distLIMIT 3FORMAT Vertical;
Row 1:──────id: 120356title: Germany national football teamtext: The Germany national football team () represents Germany in men's international football and played its first match in 1908. The team is governed by the German Football Association ("Deutscher Fußball-Bund"), founded in 1900. Between 1949 and 1990, separate German national teams were recognised by FIFA due to Allied occupation and division: the DFB's team representing the Federal Republic of Germany (commonly referred to as West Germany in English between 1949 and 1990), the Saarland team representing the Saar Protectorate (1950–1956) and the East Germany team representing the German Democratic Republic (1952–1990). The latter two were absorbed along with their records; the present team represents the reunified Federal Republic. The official name and code "Germany FR (FRG)" was shortened to "Germany (GER)" following reunification in 1990.url: https://en.wikipedia.org/wiki?curid=250204dist: 0Row 2:──────id: 120367title: Germany national football teamtext: The Federal Republic of Germany, which was referred to as West Germany, continued the DFB. With recognition by FIFA and UEFA, the DFB maintained and continued the record of the pre-war team. Switzerland was the first team that played West Germany in 1950, with the latter qualifying for the 1954 World Cup and the former hosting it.url: https://en.wikipedia.org/wiki?curid=250204dist: 5.537961Row 3:──────id: 120366title: Germany national football teamtext: After World War II, Germany was banned from competition in most sports until 1950. The DFB was not a full member of FIFA, and none of the three new German states – West Germany, East Germany, and Saarland – entered the 1950 World Cup qualifiers.url: https://en.wikipedia.org/wiki?curid=250204dist: 6.296622
不过,这些结果都来自同一篇 Wikipedia 文章的不同段落,因此参考意义有限。为了让结果更加丰富多样,我们可以添加 WHERE 子句,排除与查询记录具有相同 wiki_id 的条目:
WITH (SELECT emb FROM wikiEmbeddings WHERE id = 120356) AS lookup,(SELECT wiki_id FROM wikiEmbeddings WHERE id = 120356) AS lookup_wiki_idSELECT id, title, text, url, L2Distance(emb, lookup) AS distFROM wikiEmbeddingsWHERE wiki_id <> lookup_wiki_idORDER BY distLIMIT 3FORMAT Vertical;
Row 1:──────id: 6420title: Germanytext: Football is the most popular sport in Germany. With more than 7 million official members, the German Football Association ("Deutscher Fußball-Bund") is the largest single-sport organisation worldwide, and the German top league, the Bundesliga, attracts the second-highest average attendance of all professional sports leagues in the world. The German men's national football team won the FIFA World Cup in 1954, 1974, 1990, and 2014, the UEFA European Championship in 1972, 1980 and 1996, and the FIFA Confederations Cup in 2017.url: https://en.wikipedia.org/wiki?curid=11867dist: 6.9714794Row 2:──────id: 6336title: Germanytext: Formal unification of Germany into the modern nation-state was commenced on 18 August 1866 with the North German Confederation Treaty establishing the Prussia-led North German Confederation later transformed in 1871 into the German Empire. After World War I and the German Revolution of 1918–1919, the Empire was in turn transformed into the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a totalitarian dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany as a whole was organized into two separate polities with limited sovereignity: the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany, while Berlin continued its Four Power status. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic.url: https://en.wikipedia.org/wiki?curid=11867dist: 6.98255Row 3:──────id: 7767title: FIFA World Cuptext: After FIFA was founded in 1904, it tried to arrange an international football tournament between nations outside the Olympic framework in Switzerland in 1906. These were very early days for international football, and the official history of FIFA describes the competition as having been a failure.url: https://en.wikipedia.org/wiki?curid=11370dist: 7.5603456dist: 7.736428
正如前文所述,WHERE 子句中的过滤条件可以采用后置或前置方式执行。
后置过滤指的是先使用向量索引检索候选数据,再对其执行过滤条件;而前置过滤则是先筛选符合条件的记录,再进行向量相似度匹配。
你可以通过设置系统参数 vector_search_filter_strategy 来控制过滤方式。该参数默认值为 auto,表示由查询引擎根据启发式规则自动决定最优策略。你也可以将其显式设置为 postfilter 或 prefilter,以指定过滤顺序。

好消息:ClickHouse Hangzhou User Group第 2 届 Meetup 火热报名中,将于2025年07月05日在杭州西溪万怡酒店(杭州市西湖区蒋村街道文二西路770号.中国五村园8号楼)举行,扫码免费报名


/END/
试用阿里云 ClickHouse企业版
轻松节省30%云资源成本?阿里云数据库ClickHouse 云原生架构全新升级,首次购买ClickHouse企业版计算和存储资源组合,首月消费不超过99.58元(包含最大16CCU+450G OSS用量)了解详情:https://t.aliyun.com/Kz5Z0q9G


征稿启示
面向社区长期正文,文章内容包括但不限于关于 ClickHouse 的技术研究、项目实践和创新做法等。建议行文风格干货输出&图文并茂。质量合格的文章将会发布在本公众号,优秀者也有机会推荐到 ClickHouse 官网。请将文章稿件的 WORD 版本发邮件至:Tracy.Wang@clickhouse.com






