现在的aggr distinct支持
现在IMCI依靠Join + HashGroupby实现对aggr distinct的支持,在这里分类列出各个情况下的改写策略
不含grouping clause的情况
仅含一个agg discinct
形如以下的SQL
SELECT
COUNT(DISTINCT col1)
FROM t1;
会被改写为
SELECT
COUNT(tmp.col1)
FROM
(
SELECT
col1
FROM
t1
GROUP BY col1);
仅含有多个agg discinct
形如以下的SQL
SELECT
COUNT(DISTINCT col1), COUNT(DISTINCT col2)
FROM t1;
会被改写为
CREATE VIEW v1 AS
SELECT
COUNT(tmp.col1)
FROM
(
SELECT
col1
FROM
t1
GROUP BY col1);
CREATE VIEW v2 AS
SELECT
COUNT(tmp.col2)
FROM
(
SELECT
col2
FROM
t1
GROUP BY col2);
SELECT * FROM v1, v2;
同时含有普通agg和agg discinct
形如以下的SQL
SELECT
COUNT(DISTINCT col1), COUNT(DISTINCT col2), SUM(col3)
FROM t1;
会被改写为
CREATE VIEW v1 AS
SELECT
COUNT(tmp.col1)
FROM
(
SELECT
col1
FROM
t1
GROUP BY col1);
CREATE VIEW v2 AS
SELECT
COUNT(tmp.col2)
FROM
(
SELECT
col2
FROM
t1
GROUP BY col2);
CREATE VIEW v3 AS
SELECT SUM(col3) FROM t1;
SELECT * FROM v1, v2, v3;
含grouping clause的情况
仅含一个agg discinct
形如以下的SQL
SELECT
COUNT(DISTINCT col1)
FROM t1
GROUP BY col2;
会被改写为
SELECT
COUNT(tmp.col1)
FROM
(
SELECT
col1, col2
FROM
t1
GROUP BY col1, col2)
GROUP BY col2;
仅含有多个agg discinct
形如以下的SQL
SELECT
COUNT(DISTINCT col1), COUNT(DISTINCT col2)
FROM t1
GROUP BY col3;
会被改写为
CREATE VIEW v1 AS
SELECT
col3, COUNT(tmp.col1)
FROM
(
SELECT
col1, col3
FROM
t1
GROUP BY col1, col3
)
GROUP BY col3;
CREATE VIEW v2 AS
SELECT
col3, COUNT(tmp.col2)
FROM
(
SELECT
col2, col3
FROM
t1
GROUP BY col2, col3
)
GROUP BY col3;
SELECT * FROM v1 JOIN v2 ON v1.col3 = v2.col3;
同时含有普通agg和agg discinct
形如以下的SQL
SELECT
COUNT(DISTINCT col1), COUNT(DISTINCT col2), SUM(col3)
FROM t1
GROUP BY col4;
会被改写为
CREATE VIEW v1 AS
SELECT
col4, COUNT(tmp.col1)
FROM
(
SELECT
col1, col4
FROM
t1
GROUP BY col1, col4
)
GROUP BY col4;
CREATE VIEW v2 AS
SELECT
col4, COUNT(tmp.col2)
FROM
(
SELECT
col2, col4
FROM
t1
GROUP BY col2, col4
)
GROUP BY col4;
CREATE VIEW v3 AS
SELECT SUM(col3) FROM t1 GROUP BY col4;
SELECT * FROM v1, v2, v3 WHERE v1.col4 = v2.col4 and v2.col4 = v3.col4;
可以看出,大体的思路是:
- 对于单个
agg(distinct col), 使用叠加一个group by的方式使y = distinct col,然后计算agg(y) - 对于含有多个agg的情形,先拆分为多个group by,每个group by只含有一个agg,因此每个含有agg distinct的group by都可以依赖1来解决
但是,这样的作法在OSS外表的查询执行下有一些问题,因为这样的改写会生成大量的TableScan,在单机引擎中,我们认为TableScan算子是一个开销很小的算子,因为在单机上我们可以利用lru cache, pruner, prefetch等优化,同时读local disk和polar store相对较快,但是OSS上的TableScan和单机引擎有一些不同
- 以上的lru cache, pruner, prefetch等优化均不存在
- 目前的过滤条件使通过Scan + Filter两个算子实现,Filter没有下推到scan,scan输出的数据量很大
这直接导致了OSS上含有多个agg, distinct的query查询效果很差,因此,可能需要一个不含有join的处理agg distinct的方式
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




