暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

HiveQL实战蚂蚁森林低碳用户排名分析:解法一

大数据真有意思 2020-05-30
623

蚂蚁森林背景和数据准备:HiveQL实战蚂蚁森林植物申领统计分析

table_name:user_low_carbon  

user_iddata_dtlow_carbon
用户日期减少碳排放(g)


蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量。

table_name:  plant_carbon

plant_idplant_namelow_carbon
植物编号植物名换购植物所需要的碳


问题:查询user_low_carbon表中每日流水记录,条件为:

用户在2020年,连续三天(或以上)的天数里,

每天减少碳排放(low_carbon)都超过100g的用户低碳流水。

需要查询返回满足以上条件的user_low_carbon表中的记录流水。


例如,假设用户u_002符合条件的记录如下,因为2020/1/2~2020/1/5连续四天,每天的碳排放量之和都大于等于100g:

2号 150 + 70 = 220

3号 30 + 80 = 110

4号 150

5号 101


seq(key) user_id data_dt  low_carbon

xxxxx10    u_002  2020/1/2  150

xxxxx11    u_002  2020/1/2  70

xxxxx12    u_002  2020/1/3  30

xxxxx13    u_002  2020/1/3  80

xxxxx14    u_002  2020/1/4  150

xxxxx14    u_002  2020/1/5  101


第一种解法:


SQL流程【建表只是用于观察中间数据,可以选择使用临时表,session关闭后会自动删除】:


1.根据用户ID和时间两个字段进行聚合,计算每个人每天的低碳量(2020年)

    create table t1(user_id String, data_dt String, low_carbon_sum int);
    insert overwrite table t1
    select
    user_id,
    data_dt,
    sum(low_carbon) low_carbon_sum
    from user_low_carbon
    where substring(data_dt, 1, 4) = '2020'
    group BY user_id, data_dt
    having low_carbon_sum > 100; -- t1



    2.将每一条数据的前后各两行数据的时间放置在一行,默认值为(1970/7/1)

      create table t2(user_id String, data_dt String, lag2_date String, lag1_date String, lead1_date String, lead2_date String);
      insert overwrite table t2
      select
      user_id,
      data_dt,
      lag(data_dt, 2, "1970/7/1") over(partition by user_id) as lag2_date,
      lag(data_dt, 1, "1970/7/1") over(partition by user_id) as lag1_date,
      lead(data_dt, 1, "1970/7/1") over(partition by user_id) as lead1_date,
      lead(data_dt, 2, "1970/7/1") over(partition by user_id) as lead2_date
      from t1; -- t2



        hive> select * from t2;
        OK
        u_001 2020/1/2 1970/7/1 1970/7/1 2020/1/6 1970/7/1
        u_001 2020/1/6 1970/7/1 2020/1/2 1970/7/1 1970/7/1
        u_002 2020/1/3 1970/7/1 1970/7/1 2020/1/5 2020/1/4
        u_002 2020/1/5 1970/7/1 2020/1/3 2020/1/4 2020/1/2
        u_002 2020/1/4 2020/1/3 2020/1/5 2020/1/2 1970/7/1
        u_002 2020/1/2 2020/1/5 2020/1/4 1970/7/1 1970/7/1
        ...



        3.计算每一天数据时间与前后两条数据之间的差值

          create table t3(user_id String, data_dt String, lag2 int, lag1 int, lead1 int, lead2 int);
          insert overwrite table t3
          select
          user_id,
          data_dt,
          datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lag2_date, "/", "-")) lag2,
          datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lag1_date, "/", "-")) lag1,
          datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lead1_date, "/", "-")) lead1,
          datediff(regexp_replace(data_dt, "/", "-"), regexp_replace(lead2_date, "/", "-")) lead2
          from t2; -- t3



            hive> select * from t3;
            OK
            u_001 2020/1/2 18082 18082 -4 18082
            u_001 2020/1/6 18086 4 18086 18086
            u_002 2020/1/3 18083 18083 -2 -1
            u_002 2020/1/5 18085 2 1 3
            u_002 2020/1/4 1 -1 2 18084
            u_002 2020/1/2 -3 -2 18082 18082
            ...


            4.取出最终需要的值,连续3天的(user_id, data_dt)【连续4天及以上自然是包含连续3天】

            前面三天、中间三天、后面三天

              create table t4(user_id String, data_dt String);
              insert overwrite table t4
              select
              user_id,
              data_dt
              from t3
              where (lag2 = 2 and lag1 = 1) or (lag1 = 1 and lead1 = -1) or (lead1 = -1 and lead2 = -2); -- t4


                hive> select * from t4;
                OK
                u_010 2020/1/4
                u_010 2020/1/5
                u_010 2020/1/6
                u_010 2020/1/7
                u_011 2020/1/1
                u_011 2020/1/2
                u_011 2020/1/3
                u_013 2020/1/2
                u_013 2020/1/3
                u_013 2020/1/4
                u_013 2020/1/5
                u_014 2020/1/5
                u_014 2020/1/6
                u_014 2020/1/7
                Time taken: 0.36 seconds, Fetched: 14 row(s)


                5.与原表join得到最终需要的结果

                  select
                  t5.user_id,
                  t5.data_dt,
                  t5.low_carbon
                  from user_low_carbon t5 join t4
                  where t4.user_id = t5.user_id and t4.data_dt = t5.data_dt;


                  6.最终sql

                    SELECT
                    t5.user_id,
                    t5.data_dt,
                    t5.low_carbon
                    FROM
                    user_low_carbon t5
                    JOIN (
                    SELECT
                    user_id,
                    data_dt
                    FROM
                    (
                    SELECT
                    user_id,
                    data_dt,
                    datediff(
                    regexp_replace (data_dt, "/", "-"),
                    regexp_replace (lag2_date, "/", "-")
                    ) lag2,
                    datediff(
                    regexp_replace (data_dt, "/", "-"),
                    regexp_replace (lag1_date, "/", "-")
                    ) lag1,
                    datediff(
                    regexp_replace (data_dt, "/", "-"),
                    regexp_replace (lead1_date, "/", "-")
                    ) lead1,
                    datediff(
                    regexp_replace (data_dt, "/", "-"),
                    regexp_replace (lead2_date, "/", "-")
                    ) lead2
                    FROM
                    (
                    SELECT
                    user_id,
                    data_dt,
                    lag (data_dt, 2, "1970/7/1") over (PARTITION BY user_id) AS lag2_date,
                    lag (data_dt, 1, "1970/7/1") over (PARTITION BY user_id) AS lag1_date,
                    lead (data_dt, 1, "1970/7/1") over (PARTITION BY user_id) AS lead1_date,
                    lead (data_dt, 2, "1970/7/1") over (PARTITION BY user_id) AS lead2_date
                    FROM
                    (
                    SELECT
                    user_id,
                    data_dt,
                    sum(low_carbon) low_carbon_sum
                    FROM
                    user_low_carbon
                    WHERE
                    substring(data_dt, 1, 4) = '2020'
                    GROUP BY
                    user_id,
                    data_dt
                    HAVING
                    low_carbon_sum > 100
                    ) t1
                    ) t2
                    ) t3
                    WHERE
                    (lag2 = 2 AND lag1 = 1)
                    OR (lag1 = 1 AND lead1 = - 1)
                    OR (lead1 =- 1 AND lead2 = - 2)
                    ) t4
                    WHERE
                    t4.user_id = t5.user_id
                    AND t4.data_dt = t5.data_dt;


                    7.结果展示

                      ...
                      OK
                      u_002 2020/1/2 150
                      u_002 2020/1/2 70
                      u_002 2020/1/3 30
                      u_002 2020/1/3 80
                      u_002 2020/1/4 150
                      u_002 2020/1/5 101
                      u_005 2020/1/2 50
                      u_005 2020/1/2 80
                      u_005 2020/1/3 180
                      u_005 2020/1/4 180
                      u_005 2020/1/4 10
                      u_008 2020/1/4 260
                      u_008 2020/1/5 360
                      u_008 2020/1/6 160
                      u_008 2020/1/7 60
                      u_008 2020/1/7 60
                      u_009 2020/1/2 70
                      u_009 2020/1/2 70
                      u_009 2020/1/3 170
                      u_009 2020/1/4 270
                      u_010 2020/1/4 90
                      u_010 2020/1/4 80
                      u_010 2020/1/5 90
                      u_010 2020/1/5 90
                      u_010 2020/1/6 190
                      u_010 2020/1/7 90
                      u_010 2020/1/7 90
                      u_011 2020/1/1 110
                      u_011 2020/1/2 100
                      u_011 2020/1/2 100
                      u_011 2020/1/3 120
                      u_013 2020/1/2 150
                      u_013 2020/1/2 50
                      u_013 2020/1/3 150
                      u_013 2020/1/4 550
                      u_013 2020/1/5 350
                      u_014 2020/1/5 250
                      u_014 2020/1/6 120
                      u_014 2020/1/7 270
                      u_014 2020/1/7 20
                      ...

                      推荐阅读:

                      HiveQL实战蚂蚁森林植物申领统计分析

                      深入理解Hive数据存储格式和压缩要点

                      Hive-函数

                      Hive-查询

                      Hive-DML(Data Manipulation Language)数据操作语言

                      Hive-DDL(Data Definition Language)数据定义

                      Hive优化(整理版)

                      Spark Core之Shuffle解析

                      数据仓库开发规范


                      文章转载自大数据真有意思,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                      评论