蚂蚁森林背景和数据准备:HiveQL实战蚂蚁森林植物申领统计分析
table_name:user_low_carbon
| user_id | data_dt | low_carbon |
| 用户 | 日期 | 减少碳排放(g) |
蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量。
table_name: plant_carbon
| plant_id | plant_name | low_carbon |
| 植物编号 | 植物名 | 换购植物所需要的碳 |
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2020年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如,假设用户u_002符合条件的记录如下,因为2020/1/2~2020/1/5连续四天,每天的碳排放量之和都大于等于100g:
2号 150 + 70 = 220
3号 30 + 80 = 110
4号 150
5号 101
seq(key) user_id data_dt low_carbon
xxxxx10 u_002 2020/1/2 150
xxxxx11 u_002 2020/1/2 70
xxxxx12 u_002 2020/1/3 30
xxxxx13 u_002 2020/1/3 80
xxxxx14 u_002 2020/1/4 150
xxxxx14 u_002 2020/1/5 101
第一种解法:
SQL流程【建表只是用于观察中间数据,可以选择使用临时表,session关闭后会自动删除】:
1.根据用户ID和时间两个字段进行聚合,计算每个人每天的低碳量(2020年)
create table t1(user_id String, data_dt String, low_carbon_sum int);insert overwrite table t1selectuser_id,data_dt,sum(low_carbon) low_carbon_sumfrom user_low_carbonwhere substring(data_dt, 1, 4) = '2020'group BY user_id, data_dthaving low_carbon_sum > 100; -- t1
2.将每一条数据的前后各两行数据的时间放置在一行,默认值为(1970/7/1)
create table t2(user_id String, data_dt String, lag2_date String, lag1_date String, lead1_date String, lead2_date String);insert overwrite table t2selectuser_id,data_dt,lag(data_dt, 2, "1970/7/1") over(partition by user_id) as lag2_date,lag(data_dt, 1, "1970/7/1") over(partition by user_id) as lag1_date,lead(data_dt, 1, "1970/7/1") over(partition by user_id) as lead1_date,lead(data_dt, 2, "1970/7/1") over(partition by user_id) as lead2_datefrom t1; -- t2
hive> select * from t2;OKu_001 2020/1/2 1970/7/1 1970/7/1 2020/1/6 1970/7/1u_001 2020/1/6 1970/7/1 2020/1/2 1970/7/1 1970/7/1u_002 2020/1/3 1970/7/1 1970/7/1 2020/1/5 2020/1/4u_002 2020/1/5 1970/7/1 2020/1/3 2020/1/4 2020/1/2u_002 2020/1/4 2020/1/3 2020/1/5 2020/1/2 1970/7/1u_002 2020/1/2 2020/1/5 2020/1/4 1970/7/1 1970/7/1...
3.计算每一天数据时间与前后两条数据之间的差值
create table t3(user_id String, data_dt String, lag2 int, lag1 int, lead1 int, lead2 int);insert overwrite table t3selectuser_id,data_dt,datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lag2_date, "/", "-")) lag2,datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lag1_date, "/", "-")) lag1,datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lead1_date, "/", "-")) lead1,datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lead2_date, "/", "-")) lead2from t2; -- t3
hive> select * from t3;OKu_001 2020/1/2 18082 18082 -4 18082u_001 2020/1/6 18086 4 18086 18086u_002 2020/1/3 18083 18083 -2 -1u_002 2020/1/5 18085 2 1 3u_002 2020/1/4 1 -1 2 18084u_002 2020/1/2 -3 -2 18082 18082...
4.取出最终需要的值,连续3天的(user_id, data_dt)【连续4天及以上自然是包含连续3天】
前面三天、中间三天、后面三天
create table t4(user_id String, data_dt String);insert overwrite table t4selectuser_id,data_dtfrom t3where (lag2 = 2 and lag1 = 1) or (lag1 = 1 and lead1 = -1) or (lead1 = -1 and lead2 = -2); -- t4
hive> select * from t4;OKu_010 2020/1/4u_010 2020/1/5u_010 2020/1/6u_010 2020/1/7u_011 2020/1/1u_011 2020/1/2u_011 2020/1/3u_013 2020/1/2u_013 2020/1/3u_013 2020/1/4u_013 2020/1/5u_014 2020/1/5u_014 2020/1/6u_014 2020/1/7Time taken: 0.36 seconds, Fetched: 14 row(s)
5.与原表join得到最终需要的结果
selectt5.user_id,t5.data_dt,t5.low_carbonfrom user_low_carbon t5 join t4where t4.user_id = t5.user_id and t4.data_dt = t5.data_dt;
6.最终sql
SELECTt5.user_id,t5.data_dt,t5.low_carbonFROMuser_low_carbon t5JOIN (SELECTuser_id,data_dtFROM(SELECTuser_id,data_dt,datediff(regexp_replace (data_dt, "/", "-"),regexp_replace (lag2_date, "/", "-")) lag2,datediff(regexp_replace (data_dt, "/", "-"),regexp_replace (lag1_date, "/", "-")) lag1,datediff(regexp_replace (data_dt, "/", "-"),regexp_replace (lead1_date, "/", "-")) lead1,datediff(regexp_replace (data_dt, "/", "-"),regexp_replace (lead2_date, "/", "-")) lead2FROM(SELECTuser_id,data_dt,lag (data_dt, 2, "1970/7/1") over (PARTITION BY user_id) AS lag2_date,lag (data_dt, 1, "1970/7/1") over (PARTITION BY user_id) AS lag1_date,lead (data_dt, 1, "1970/7/1") over (PARTITION BY user_id) AS lead1_date,lead (data_dt, 2, "1970/7/1") over (PARTITION BY user_id) AS lead2_dateFROM(SELECTuser_id,data_dt,sum(low_carbon) low_carbon_sumFROMuser_low_carbonWHEREsubstring(data_dt, 1, 4) = '2020'GROUP BYuser_id,data_dtHAVINGlow_carbon_sum > 100) t1) t2) t3WHERE(lag2 = 2 AND lag1 = 1)OR (lag1 = 1 AND lead1 = - 1)OR (lead1 =- 1 AND lead2 = - 2)) t4WHEREt4.user_id = t5.user_idAND t4.data_dt = t5.data_dt;
7.结果展示
...OKu_002 2020/1/2 150u_002 2020/1/2 70u_002 2020/1/3 30u_002 2020/1/3 80u_002 2020/1/4 150u_002 2020/1/5 101u_005 2020/1/2 50u_005 2020/1/2 80u_005 2020/1/3 180u_005 2020/1/4 180u_005 2020/1/4 10u_008 2020/1/4 260u_008 2020/1/5 360u_008 2020/1/6 160u_008 2020/1/7 60u_008 2020/1/7 60u_009 2020/1/2 70u_009 2020/1/2 70u_009 2020/1/3 170u_009 2020/1/4 270u_010 2020/1/4 90u_010 2020/1/4 80u_010 2020/1/5 90u_010 2020/1/5 90u_010 2020/1/6 190u_010 2020/1/7 90u_010 2020/1/7 90u_011 2020/1/1 110u_011 2020/1/2 100u_011 2020/1/2 100u_011 2020/1/3 120u_013 2020/1/2 150u_013 2020/1/2 50u_013 2020/1/3 150u_013 2020/1/4 550u_013 2020/1/5 350u_014 2020/1/5 250u_014 2020/1/6 120u_014 2020/1/7 270u_014 2020/1/7 20...
推荐阅读:
Hive-DML(Data Manipulation Language)数据操作语言
Hive-DDL(Data Definition Language)数据定义





