暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

依靠Expand实现aggr distinct(1)

原创 手机用户2895 2024-10-29
168

现在的aggr distinct支持

现在IMCI依靠Join + HashGroupby实现对aggr distinct的支持,在这里分类列出各个情况下的改写策略

不含grouping clause的情况

仅含一个agg discinct

形如以下的SQL

SELECT COUNT(DISTINCT col1) FROM t1;

会被改写为

SELECT COUNT(tmp.col1) FROM ( SELECT col1 FROM t1 GROUP BY col1);

仅含有多个agg discinct

形如以下的SQL

SELECT COUNT(DISTINCT col1), COUNT(DISTINCT col2) FROM t1;

会被改写为

CREATE VIEW v1 AS SELECT COUNT(tmp.col1) FROM ( SELECT col1 FROM t1 GROUP BY col1); CREATE VIEW v2 AS SELECT COUNT(tmp.col2) FROM ( SELECT col2 FROM t1 GROUP BY col2); SELECT * FROM v1, v2;

同时含有普通agg和agg discinct

形如以下的SQL

SELECT COUNT(DISTINCT col1), COUNT(DISTINCT col2), SUM(col3) FROM t1;

会被改写为

CREATE VIEW v1 AS SELECT COUNT(tmp.col1) FROM ( SELECT col1 FROM t1 GROUP BY col1); CREATE VIEW v2 AS SELECT COUNT(tmp.col2) FROM ( SELECT col2 FROM t1 GROUP BY col2); CREATE VIEW v3 AS SELECT SUM(col3) FROM t1; SELECT * FROM v1, v2, v3;

含grouping clause的情况

仅含一个agg discinct

形如以下的SQL

SELECT COUNT(DISTINCT col1) FROM t1 GROUP BY col2;

会被改写为

SELECT COUNT(tmp.col1) FROM ( SELECT col1, col2 FROM t1 GROUP BY col1, col2) GROUP BY col2;

仅含有多个agg discinct

形如以下的SQL

SELECT COUNT(DISTINCT col1), COUNT(DISTINCT col2) FROM t1 GROUP BY col3;

会被改写为

CREATE VIEW v1 AS SELECT col3, COUNT(tmp.col1) FROM ( SELECT col1, col3 FROM t1 GROUP BY col1, col3 ) GROUP BY col3; CREATE VIEW v2 AS SELECT col3, COUNT(tmp.col2) FROM ( SELECT col2, col3 FROM t1 GROUP BY col2, col3 ) GROUP BY col3; SELECT * FROM v1 JOIN v2 ON v1.col3 = v2.col3;

同时含有普通agg和agg discinct

形如以下的SQL

SELECT COUNT(DISTINCT col1), COUNT(DISTINCT col2), SUM(col3) FROM t1 GROUP BY col4;

会被改写为

CREATE VIEW v1 AS SELECT col4, COUNT(tmp.col1) FROM ( SELECT col1, col4 FROM t1 GROUP BY col1, col4 ) GROUP BY col4; CREATE VIEW v2 AS SELECT col4, COUNT(tmp.col2) FROM ( SELECT col2, col4 FROM t1 GROUP BY col2, col4 ) GROUP BY col4; CREATE VIEW v3 AS SELECT SUM(col3) FROM t1 GROUP BY col4; SELECT * FROM v1, v2, v3 WHERE v1.col4 = v2.col4 and v2.col4 = v3.col4;

可以看出,大体的思路是:

  1. 对于单个agg(distinct col), 使用叠加一个group by的方式使y = distinct col,然后计算agg(y)
  2. 对于含有多个agg的情形,先拆分为多个group by,每个group by只含有一个agg,因此每个含有agg distinct的group by都可以依赖1来解决

但是,这样的作法在OSS外表的查询执行下有一些问题,因为这样的改写会生成大量的TableScan,在单机引擎中,我们认为TableScan算子是一个开销很小的算子,因为在单机上我们可以利用lru cache, pruner, prefetch等优化,同时读local disk和polar store相对较快,但是OSS上的TableScan和单机引擎有一些不同

  • 以上的lru cache, pruner, prefetch等优化均不存在
  • 目前的过滤条件使通过Scan + Filter两个算子实现,Filter没有下推到scan,scan输出的数据量很大

这直接导致了OSS上含有多个agg, distinct的query查询效果很差,因此,可能需要一个不含有join的处理agg distinct的方式

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论