点击上方【蓝色】字体 关注我们

01 场景描述
日期 内容 学习用户2022-01-01 java u12022-01-02 java u12022-01-02 java u22022-01-01 C u12022-01-01 C u32022-01-01 Python u42022-01-02 Python u42022-01-02 Python u52022-01-02 Python u6
日期 内容 去重截止学习用户数2022-01-02 java 22022-01-01 C 22022-01-02 Python 3
截止到2022-01-01,学习内容java的为去重用户数为1,学习内容C的为去重用户数为2,学习内容Python的为去重用户数为1。所以2022-01-01学习内容为java和python的都要内过滤。
02 数据准备
create table study as (-- 基础数据select '2022-01-01' as pdate, 'java' as icate, 'u1' as user_idunion allselect '2022-01-02' as pdate, 'java' as icate, 'u1' as user_idunion allselect '2022-01-02' as pdate, 'java' as icate, 'u2' as user_idunion allselect '2022-01-01' as pdate, 'C' as icate, 'u1' as user_idunion allselect '2022-01-01' as pdate, 'C' as icate, 'u3' as user_idunion allselect '2022-01-01' as pdate, 'Python' as icate, 'u4' as user_idunion allselect '2022-01-02' as pdate, 'Python' as icate, 'u4' as user_idunion allselect '2022-01-02' as pdate, 'Python' as icate, 'u5' as user_idunion allselect '2022-01-02' as pdate, 'Python' as icate, 'u6' as user_id);

03 问题分析
由于要按照内容及学习用户去重,且实现的是按照时间累计去重,因此基本的思路就是按照内容及用户分组后在最早出现时间点出标记为1,其余标记为0,以此来实现累加计算。
select pdate,icate,user_id,case when row_number() over(partition by icate,user_id order by pdate) = 1 then 1 else 0 end statusfrom study

第二步:按照日期和内容分组,对步骤1的状态进行汇总,并按照日期累计当前汇总后的状态值。具体SQL如下:
select pdate, icate, sum(sum(status)) over (partition by icate order by pdate) acc_user_cntfrom (select pdate, icate, user_id, case when row_number() over (partition by icate,user_id order by pdate) = 1 then 1 else 0 end statusfrom study) tgroup by pdate, icate

第三步:过滤掉累计值小于1的数据。最终SQL如下:
select pdate, icate, acc_user_cntfrom (select pdate, icate, sum(sum(status)) over (partition by icate order by pdate) acc_user_cntfrom (select pdate, icate, user_id, case when row_number() over (partition by icate,user_id order by pdate) = 1 then 1 else 0 end statusfrom study) tgroup by pdate, icate) twhere acc_user_cnt > 1

完整的SQL如下:
with tmp1 as (select distinct pdate, icatefrom study)select a.pdate,a.icate,count(distinct a.user_id) as icountfrom study ajointmp1 bona.pdate <= b.pdateanda.icate = b.icategroup by a.pdate, a.icatehaving icount > 1;

04 小 结
本文给出了计算指标按照日期字段进行累计去重的优雅解法,文中采用了2种解法,一种借助于分析函数作为辅助变量生成状态标记求解,一种利用自关联生成全量数据进行行行比较求解,两种方式都非常巧妙。
与本文相关的问题:
往期精彩:
SQL进阶技巧:最近有效的缺失值填充问题【last_value实现版】
会飞的一十六
扫描右侧二维码关注我们
点个【在看】 你最好看

文章转载自会飞的一十六,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。






