关注+星标,干货第一时间送达

上次写:
还是8月下旬,
那时候就在感慨时光如梭了。
如今,中秋节已然到来。
春节,还远吗?
不过回头一看,这大半年,好像什么也没干
不知诸位,是否也曾有这样的无奈。
不过,惆怅归惆怅,日子还是要一天天好好过的,
毫无疑问,题,还是歹坚持做的。
因为小编总是觉得,韧劲,才是学习进步的关键。
不管是数据结构、算法、sql、机器学习,
这些只能慢慢打磨自己的skill。
当然,小编相信,你们之中肯定有"天纵之才"
无师自通、过目不忘。
所以,大佬们,求赞
【温馨提示:建表语句及数据导入脚本,已经放在文章末尾】
一、需求场景
题目:
从用户登录明细表(user_login_detail)中统计有新增用户的日期的新增用户数(若某日未新增用户,则不出现在统计结果中),并统计这些新增用户的一日留存率。
用户首次登录为当天新增,次日也登录则为一日留存。
一日留存用户占新增用户数的比率为一日留存率。
用户登录明细表user_login_detail:

正确结果:

二、解决方案
演示工具:Hive-3.1.3 + DataGrip2022
每日新增用户数量,
是比较容易计算的,
只需要筛选出每个用户,最早的登录日期即可。
留存率,则需要关注,用户第二次登录日期,
是否和最早登录日期,相差1天。
如果相差1天,则属于留存。
明白这两点,就好着手了。
(1)筛选每个用户,最早登录日期,作为用户新增日期
select user_id,
date(min(login_ts)) as dt
from user_login_detail
group by user_id
结果集:

代码里面有:date() 函数,
是将时间中的日期,提前出来,
比如:

用 date() 提取日期:

其实和:date_format(value, 'yyyy-MM-dd') 功能一样:

日期函数有很多,要掌握些常用的:
date_format(), date_sub(), date_add(), datediff();
(2)通过left join,获取每个用户新增日期,相差一天的日期,从而筛选出留存日期
select t1.dt,
count(distinct t1.user_id) as acc,
count(distinct t2.user_id) as retent
from (
select user_id,
date(min(login_ts)) as dt
from user_login_detail
group by user_id
)t1
left join user_login_detail t2
on t1.user_id = t2.user_id
and datediff(date(t2.login_ts), dt)=1
group by t1.dt
结果集:

如果先不计算count,直接看left join 的结果,
代码是:
select *
from (
select user_id,
date(min(login_ts)) as dt
from user_login_detail
group by user_id
)t1
left join user_login_detail t2
on t1.user_id = t2.user_id
and datediff(date(t2.login_ts), dt)=1
结果集:

符合次日同一用户再次登录的,
就是106用户,在2021-10-04注册,2021-10-05再次登录了。
这样可以理解上面代码中的计算过程了吧
(3)留存率计算,通过cast,完成数据格式转换
select t3.dt,
t3.acc,
cast(cast(t3.retent as decimal(16,2)) / t3.acc as decimal(16,2)) as rate
from (
select t1.dt,
count(distinct t1.user_id) as acc,
count(distinct t2.user_id) as retent
from (
select user_id,
date(min(login_ts)) as dt
from user_login_detail
group by user_id
)t1
left join user_login_detail t2
on t1.user_id = t2.user_id
and datediff(date(t2.login_ts), dt)=1
group by t1.dt
)t3
结果集:

三、源数据
用户登录明细表user_login_detail:
create table user_login_detail
(
user_id string,
ip_address string,
login_ts string,
logout_ts string
)
row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
with serdeproperties ('field.delim' = '\t') stored as
inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://mycluster/user/hive/warehouse/hql_test2.db/user_login_detail'
tblproperties ('bucketing_version' = '2');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('101', '180.149.130.161', '2021-09-21 08:00:00', '2021-09-27 08:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('101', '180.149.130.161', '2021-09-27 08:00:00', '2021-09-27 08:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('101', '180.149.130.161', '2021-09-28 09:00:00', '2021-09-28 09:10:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('101', '180.149.130.161', '2021-09-29 13:30:00', '2021-09-29 13:50:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('101', '180.149.130.161', '2021-09-30 20:00:00', '2021-09-30 20:10:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('102', '120.245.11.2', '2021-09-22 09:00:00', '2021-09-27 09:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('102', '120.245.11.2', '2021-10-01 08:00:00', '2021-10-01 08:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('102', '180.149.130.174', '2021-10-01 07:50:00', '2021-10-01 08:20:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('102', '120.245.11.2', '2021-10-02 08:00:00', '2021-10-02 08:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('103', '27.184.97.3', '2021-09-23 10:00:00', '2021-09-27 10:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('103', '27.184.97.3', '2021-10-03 07:50:00', '2021-10-03 09:20:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('104', '27.184.97.34', '2021-09-24 11:00:00', '2021-09-27 11:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('104', '27.184.97.34', '2021-10-03 07:50:00', '2021-10-03 08:20:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('104', '27.184.97.34', '2021-10-03 08:50:00', '2021-10-03 10:20:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('104', '120.245.11.89', '2021-10-03 08:40:00', '2021-10-03 10:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('105', '119.180.192.212', '2021-10-04 09:10:00', '2021-10-04 09:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('106', '119.180.192.66', '2021-10-04 08:40:00', '2021-10-04 10:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('106', '119.180.192.66', '2021-10-05 21:50:00', '2021-10-05 22:40:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('107', '219.134.104.7', '2021-09-25 12:00:00', '2021-09-27 12:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('107', '219.134.104.7', '2021-10-05 22:00:00', '2021-10-05 23:00:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('107', '219.134.104.7', '2021-10-06 09:10:00', '2021-10-06 10:20:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('107', '27.184.97.46', '2021-10-06 09:00:00', '2021-10-06 10:00:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('108', '101.227.131.22', '2021-10-06 09:00:00', '2021-10-06 10:00:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('108', '101.227.131.22', '2021-10-06 22:00:00', '2021-10-06 23:00:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('109', '101.227.131.29', '2021-09-26 13:00:00', '2021-09-27 13:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('109', '101.227.131.29', '2021-10-06 08:50:00', '2021-10-06 10:20:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('109', '101.227.131.29', '2021-10-08 09:00:00', '2021-10-08 09:10:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('1010', '119.180.192.10', '2021-09-27 14:00:00', '2021-09-27 14:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('1010', '119.180.192.10', '2021-10-09 08:50:00', '2021-10-09 10:20:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('1011', '180.149.130.161', '2021-09-21 08:00:00', '2021-09-27 08:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('1011', '180.149.130.161', '2021-10-24 08:00:00', '2021-10-29 08:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('1011', '180.149.130.161', '2022-09-21 08:00:00', '2022-09-27 08:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('105', '119.180.192.212', '2021-10-06 09:10:00', '2021-10-04 09:30:00');
INSERT INTO hql_test2.user_login_detail (user_id, ip_address, login_ts, logout_ts) VALUES ('106', '119.180.192.66', '2021-10-05 08:50:00', '2021-10-05 12:40:00');
OK,这就是本期的内容了,下期再见!




