蚂蚁森林背景和数据准备:HiveQL实战蚂蚁森林植物申领统计分析
table_name:user_low_carbon
| user_id | data_dt | low_carbon |
| 用户 | 日期 | 减少碳排放(g) |
蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量。
table_name: plant_carbon
| plant_id | plant_name | low_carbon |
| 植物编号 | 植物名 | 换购植物所需要的碳 |
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2020年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如,假设用户u_002符合条件的记录如下,因为2020/1/2~2020/1/5连续四天,每天的碳排放量之和都大于等于100g:
2号 150 + 70 = 220
3号 30 + 80 = 110
4号 150
5号 101
seq(key) user_id data_dt low_carbon
xxxxx10 u_002 2020/1/2 150
xxxxx11 u_002 2020/1/2 70
xxxxx12 u_002 2020/1/3 30
xxxxx13 u_002 2020/1/3 80
xxxxx14 u_002 2020/1/4 150
xxxxx14 u_002 2020/1/5 101
第二种解法:
SQL流程【建表只是用于观察中间数据,可以选择使用临时表,session关闭后会自动删除】:
1.根据用户ID和时间两个字段进行聚合,计算每个人每天的低碳量(2020年)并给每一条数据打标签(同一个用户不同时间排序)
create table t1(user_id String, data_dt String, low_carbon_sum int, rn int);insert overwrite table t1selectuser_id,data_dt,sum(low_carbon) low_carbon_sum,row_number() over(partition by user_id order by data_dt) as rnfrom user_low_carbonwhere substring(data_dt, 1, 4) = '2020'group BY user_id, data_dthaving low_carbon_sum > 100; -- t1
hive> select * from t1;OKu_001 2020/1/2 270 1u_001 2020/1/6 135 2u_002 2020/1/2 220 1u_002 2020/1/3 110 2u_002 2020/1/4 150 3u_002 2020/1/5 101 4...
2.获取每一条数据时间跟标签之间的差值
create table t2(user_id String, data_dt String, diff_date String);insert overwrite table t2selectuser_id,data_dt,date_sub(to_date(regexp_replace(data_dt, "/", "-")), rn) diff_datefrom t1; -- t2
hive> select * from t2;OKu_001 2020/1/2 2020-01-01u_001 2020/1/6 2020-01-04u_002 2020/1/2 2020-01-01u_002 2020/1/3 2020-01-01u_002 2020/1/4 2020-01-01u_002 2020/1/5 2020-01-01...
3.按照所获得的差值聚合,得到同一个用户下相同差值的个数
create table t3(user_id String, data_dt String, diff_date_count int);insert overwrite table t3selectuser_id,data_dt,count(*) over(partition by user_id, diff_date) diff_date_countfrom t2; -- t3
hive> select * from t3;OKu_001 2020/1/2 1u_001 2020/1/6 1u_002 2020/1/2 4u_002 2020/1/3 4u_002 2020/1/4 4u_002 2020/1/5 4...
4.过滤出相同差值个数在3及以上的数据
create table t4(user_id String, data_dt String);insert overwrite table t4selectuser_id,data_dtfrom t3 where diff_date_count >= 3; -- t4
hive> select * from t4;OKu_002 2020/1/2u_002 2020/1/3u_002 2020/1/4u_002 2020/1/5u_005 2020/1/3u_005 2020/1/2u_005 2020/1/4...
5.与原表join得到最终需要的结果
selectt5.user_id,t5.data_dt,t5.low_carbonfrom user_low_carbon t5join t4where t4.user_id = t5.user_id and t4.data_dt = t5.data_dtorder by t5.user_id, t5.data_dt;
6.最终sql
SELECTt5.user_id,t5.data_dt,t5.low_carbonFROMuser_low_carbon t5JOIN (SELECTuser_id,data_dtFROM(SELECTuser_id,data_dt,count(*) over (PARTITION BY user_id,diff_date) diff_date_countFROM(SELECTuser_id,data_dt,date_sub(to_date (regexp_replace (data_dt, "/", "-")),rn) diff_dateFROM(SELECTuser_id,data_dt,sum(low_carbon) low_carbon_sum,row_number () over (PARTITION BY user_idORDER BYdata_dt) AS rnFROMuser_low_carbonWHEREsubstring(data_dt, 1, 4) = '2020'GROUP BYuser_id,data_dtHAVINGlow_carbon_sum > 100) t1) t2) t3WHEREdiff_date_count >= 3) t4WHEREt4.user_id = t5.user_idAND t4.data_dt = t5.data_dtORDER BYt5.user_id,t5.data_dt;
7.结果展示
MapReduce Jobs Launched:Stage-Stage-1: HDFS Read: 12976 HDFS Write: 974 SUCCESSStage-Stage-2: HDFS Read: 12976 HDFS Write: 974 SUCCESSStage-Stage-4: HDFS Read: 12976 HDFS Write: 974 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKu_002 2020/1/2 150u_002 2020/1/2 70u_002 2020/1/3 30u_002 2020/1/3 80u_002 2020/1/4 150u_002 2020/1/5 101u_005 2020/1/2 50u_005 2020/1/2 80u_005 2020/1/3 180u_005 2020/1/4 180u_005 2020/1/4 10u_008 2020/1/4 260u_008 2020/1/5 360u_008 2020/1/6 160u_008 2020/1/7 60u_008 2020/1/7 60u_009 2020/1/2 70u_009 2020/1/2 70u_009 2020/1/3 170u_009 2020/1/4 270u_010 2020/1/4 90u_010 2020/1/4 80u_010 2020/1/5 90u_010 2020/1/5 90u_010 2020/1/6 190u_010 2020/1/7 90u_010 2020/1/7 90u_011 2020/1/1 110u_011 2020/1/2 100u_011 2020/1/2 100u_011 2020/1/3 120u_013 2020/1/2 150u_013 2020/1/2 50u_013 2020/1/3 150u_013 2020/1/4 550u_013 2020/1/5 350u_014 2020/1/5 250u_014 2020/1/6 120u_014 2020/1/7 270u_014 2020/1/7 20
推荐阅读:
Hive-DML(Data Manipulation Language)数据操作语言
Hive-DDL(Data Definition Language)数据定义





