暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

数仓面试之如何统计统计注册用户的留存数与留存率

大数据研习社 2022-07-08
799

长按二维码关注

大数据领域必关注的公众号



01
题目描述:

现在有一个用户登录表(user_active_log),里面有2个字段:userId(用户ID)和createdTime(登录时间戳),需要统计近1、2、3、5、7、30日留存用户数量及留存率。


分析与解答:

要想统计近1、2、3、5、7、30日留存用户数量及留存率,其核心条件就是统计时间距离(留存天数),实现思路就是使用登录日期减去第一次登录的日期差值。


1.准备数据集

用户登录日志文件user_active_log.txt中的部分测试数据集如下所示。

1,2022-06-01

1,2022-06-02

1,2022-06-03

1,2022-06-04

1,2022-06-05

1,2022-06-06

1,2022-06-07

1,2022-06-08

1,2022-06-09

2,2022-06-01

2,2022-06-02

2,2022-06-03

2,2022-06-04

2,2022-06-05

3,2022-06-01

3,2022-06-02

4,2022-06-01

4,2022-06-02

4,2022-06-03

4,2022-06-04

4,2022-06-05

4,2022-06-06

4,2022-06-07

4,2022-06-08

5,2022-06-01

5,2022-06-02

5,2022-06-03

5,2022-06-04

6,2022-06-01

6,2022-06-02

6,2022-06-03

在用户登录日志数据格式中,第一列(userId)表示用户id,第二列(createdTime)表示登录日期。


2.创建Hive 表

在hive CLI客户端中创建user_active_log表,用来存储用户登录日志。

    DROP TABLE IF EXISTS `user_active_log`;
    CREATE TABLE `user_active_log` (
    userId int,
    createdTime string
    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE ;


    3.导入测试数据集

    在Hive CLI客户端中,使用如下命令将用户登录日志文件user_active_log.txt加载到user_active_log表中。

      load data local inpath '/home/hadoop/shell/data/user_active_log.txt' overwrite into table user_active_log;


      4.根据用户ID和登录日期去重

        select
        userId,
        createdTime
        from user_active_log
        group by userId,createdTime;


        5.添加first_time新字段,存储每个userId下的最早登录日期

          select userId,createdTime,first_value(createdTime) over(partition by userId order by createdTime ) first_time from  
          (
          select userId,createdTime
          from user_active_log
          group by userId,createdTime
          ) t0;


          6.添加keep_time新字段,使用登录日期列减去最早登录日期first_time得到留存天数

            select
            userId,
            createdTime,
            first_value(createdTime) over(partition by userId order by createdTime ) first_time,
            datediff(createdTime, first_value(createdTime) over(partition by userId order by createdTime )) keep_time
            from
            (
            select userId,createdTime
            from user_active_log
            group by userId,createdTime
            ) t0;


            7.按登录日期统计不同留存天数对应的次数即为某日的近N日留存数

              select
              t1.first_time,
              sum(case when t1.keep_time = 1 THEN 1 ELSE 0 END) day1,
              sum(case when t1.keep_time = 2 THEN 1 ELSE 0 END) day2,
              sum(case when t1.keep_time = 3 THEN 1 ELSE 0 END) day3,
              sum(case when t1.keep_time = 5 THEN 1 ELSE 0 END) day5,
              sum(case when t1.keep_time = 7 THEN 1 ELSE 0 END) day7,
              sum(case when t1.keep_time = 30 THEN 1 ELSE 0 END) day30
              from (
              select
              userId,
              createdTime,
              first_value(createdTime) over(partition by userId order by createdTime ) first_time,
              datediff(createdTime, first_value(createdTime) over(partition by userId order by createdTime )) keep_time
              from
              (
              select userId,createdTime
              from user_active_log
              group by userId,createdTime
              ) t0
              ) t1 group by t1.first_time order by t1.first_time;


              8.使用某日的近N日留存数除以首日登录人数即为留存率

                select
                t1.first_time,
                sum(case when t1.keep_time = 1 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day1,
                sum(case when t1.keep_time = 2 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day2,
                sum(case when t1.keep_time = 3 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day3,
                sum(case when t1.keep_time = 5 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day5,
                sum(case when t1.keep_time = 7 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day7,
                sum(case when t1.keep_time = 30 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day30
                from (
                select
                userId,
                createdTime,
                first_value(createdTime) over(partition by userId order by createdTime ) first_time,
                datediff(createdTime, first_value(createdTime) over(partition by userId order by createdTime )) keep_time
                from
                (
                select userId,createdTime
                from user_active_log
                group by userId,createdTime
                ) t0
                ) t1 group by t1.first_time order by t1.first_time;
                欢迎点赞 + 收藏 + 在看  素质三连 


                往期精彩回顾
                程序员,如何避免内卷
                Apache 架构师总结的 30 条架构原则
                【全网首发】Hadoop 3.0分布式集群安装
                大数据运维工程师经典面试题汇总(附带答案)
                大数据面试130题
                某集团大数据平台整体架构及实施方案完整目录
                大数据凉凉了?Apache将一众大数据开源项目束之高阁!
                实战企业数据湖,抢先数仓新玩法
                Superset制作智慧数据大屏,看它就够了
                Apache Flink 在快手的过去、现在和未来
                华为云-基于Ambari构建大数据平台(上)
                华为云-基于Ambari构建大数据平台(下)
                【HBase调优】Hbase万亿级存储性能优化总结
                【Python精华】100个Python练手小程序
                【HBase企业应用开发】工作中自己总结的Hbase笔记,非常全面!
                【剑指Offer】近50个常见算法面试题的Java实现代码

                长按识别左侧二维码

                     关注领福利    

                  领10本经典大数据书

                文章转载自大数据研习社,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                评论