

长按二维码关注
大数据领域必关注的公众号


现在有一个用户登录表(user_active_log),里面有2个字段:userId(用户ID)和createdTime(登录时间戳),需要统计近1、2、3、5、7、30日留存用户数量及留存率。
分析与解答:
要想统计近1、2、3、5、7、30日留存用户数量及留存率,其核心条件就是统计时间距离(留存天数),实现思路就是使用登录日期减去第一次登录的日期差值。
1.准备数据集
用户登录日志文件user_active_log.txt中的部分测试数据集如下所示。
1,2022-06-01
1,2022-06-02
1,2022-06-03
1,2022-06-04
1,2022-06-05
1,2022-06-06
1,2022-06-07
1,2022-06-08
1,2022-06-09
2,2022-06-01
2,2022-06-02
2,2022-06-03
2,2022-06-04
2,2022-06-05
3,2022-06-01
3,2022-06-02
4,2022-06-01
4,2022-06-02
4,2022-06-03
4,2022-06-04
4,2022-06-05
4,2022-06-06
4,2022-06-07
4,2022-06-08
5,2022-06-01
5,2022-06-02
5,2022-06-03
5,2022-06-04
6,2022-06-01
6,2022-06-02
6,2022-06-03
在用户登录日志数据格式中,第一列(userId)表示用户id,第二列(createdTime)表示登录日期。
2.创建Hive 表
在hive CLI客户端中创建user_active_log表,用来存储用户登录日志。
DROP TABLE IF EXISTS `user_active_log`;CREATE TABLE `user_active_log` (userId int,createdTime string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE ;
3.导入测试数据集
在Hive CLI客户端中,使用如下命令将用户登录日志文件user_active_log.txt加载到user_active_log表中。
load data local inpath '/home/hadoop/shell/data/user_active_log.txt' overwrite into table user_active_log;
4.根据用户ID和登录日期去重
selectuserId,createdTimefrom user_active_loggroup by userId,createdTime;
5.添加first_time新字段,存储每个userId下的最早登录日期
select userId,createdTime,first_value(createdTime) over(partition by userId order by createdTime ) first_time from(select userId,createdTimefrom user_active_loggroup by userId,createdTime) t0;
6.添加keep_time新字段,使用登录日期列减去最早登录日期first_time得到留存天数
selectuserId,createdTime,first_value(createdTime) over(partition by userId order by createdTime ) first_time,datediff(createdTime, first_value(createdTime) over(partition by userId order by createdTime )) keep_timefrom(select userId,createdTimefrom user_active_loggroup by userId,createdTime) t0;
7.按登录日期统计不同留存天数对应的次数即为某日的近N日留存数
selectt1.first_time,sum(case when t1.keep_time = 1 THEN 1 ELSE 0 END) day1,sum(case when t1.keep_time = 2 THEN 1 ELSE 0 END) day2,sum(case when t1.keep_time = 3 THEN 1 ELSE 0 END) day3,sum(case when t1.keep_time = 5 THEN 1 ELSE 0 END) day5,sum(case when t1.keep_time = 7 THEN 1 ELSE 0 END) day7,sum(case when t1.keep_time = 30 THEN 1 ELSE 0 END) day30from (selectuserId,createdTime,first_value(createdTime) over(partition by userId order by createdTime ) first_time,datediff(createdTime, first_value(createdTime) over(partition by userId order by createdTime )) keep_timefrom(select userId,createdTimefrom user_active_loggroup by userId,createdTime) t0) t1 group by t1.first_time order by t1.first_time;
selectt1.first_time,sum(case when t1.keep_time = 1 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day1,sum(case when t1.keep_time = 2 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day2,sum(case when t1.keep_time = 3 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day3,sum(case when t1.keep_time = 5 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day5,sum(case when t1.keep_time = 7 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day7,sum(case when t1.keep_time = 30 THEN 1 ELSE 0 END) / count(DISTINCT t1.userId) day30from (selectuserId,createdTime,first_value(createdTime) over(partition by userId order by createdTime ) first_time,datediff(createdTime, first_value(createdTime) over(partition by userId order by createdTime )) keep_timefrom(select userId,createdTimefrom user_active_loggroup by userId,createdTime) t0) t1 group by t1.first_time order by t1.first_time;
完





