用户特征也称为用户标签。是描述用户某一方面(某一个维度)表现的抽象描述。
一般常说的,某个用户或者某个产品的人群画像是一线城市,男性,高收入,爱运动。
这里,一线城市是对用户活动IP的抽象,男是性别或者经常购买商品的性别偏好,高收入是对其月收入/年收入的概括,爱运动是兴趣爱好。
可以看到每一个特征都是用户的一个维度上的表现。有些是基本信息,有些是根据其行为抽象出的信息。
无论我们是做数据挖掘,需要大量的行为特征,还是客户服务,需要了解客户的各维度信息,还是人群分析发现更多的可能,种种都离不开用户特征。
而获取这些信息可能是填写,可能是购买,但是更多时候我们可以根据其再互联网上留下的痕迹抽象出来。
以下是常见的几类行为特征,本文会总结下常见的行为特征怎么用SQL从行为数据中抽取~
1.是否某类用户
2.某时间区内属性特征
3.首末次特征
4.时间距离类和间隔标签
5.比例型标签。
6.区间分布类标签
7.偏好类标签
8.连续型标签
9.一些组合型的标签处理规则
这里先放一个内容大纲。信息较多,所以截图看下。下边会仔细总结每个怎么取获取

假设当前:
用户行为表为events, 包含字段user_id(用户唯一id),event(事件名),time,date
用户表为users. 包含id(用户唯一id),以及常规用户属性信息。
以下从行为表和用户表进行标签创建。生成的标签的格式为:用户id,标签名。
1.是否某类用户
比如,我们经常查看用户是否留存或者是否付费,与其他行为之间的相关性。
1.1是否次留用户
这里示例是否次留用户。
selecta.id as id,if(b.id is not null,1,0) as tag_is_ciliufrom users aleft join(select distinct a.id from users aleft JOIN events bon a.id=b.user_id and datediff(b.time,to_timestamp($first_visit_time/1000))=1where b.user_id is not null ##次留的用户#)bon a.id=b.id
1.2是否首日付费?是否付费?是否3日内付费?
与1.1中类似,区别在于这里获取用户首日和留存数据均来自行为表。
selecta.user_id as id,if(b.user_id is not null,1,0) as tag_firstday_payfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(select distinct a.user_id from(select a.user_id,min(date) as first_day from events a group by a.user_id)aleft JOIN(select distinct a.user_id,date from events where event='PayOrder')bon a.id=b.user_id and datediff(b.date,first_day)=0where b.user_id is not null##这里是首日付费的用户###去掉datediff(b.date,first_day)=0限制,则为有付费的用户###如果需要3日内是否付费,则在将datediff(b.date,first_day)=0改为<=2。####)bon a.user_id=b.user_id
2.某时间区间内属性特征
2.1首日/N日累计付费金额,付费次数,浏览次数等
select a.user_id as id,if(b.pay_money is null,0,b.pay_money) as tag_firstday_pay_money##首日付费次数金额#####改为if(b.pay_cishu is null,0,b.pay_cishu) as tag_firstday_pay_cishu时,则为首日付费次数####from (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(##########这里是取满足条件的用户及其特征值#########select a.user_id,count(order_amount) as pay_cishu,sum(order_amount) as pay_money from(select a.user_id,min(date) as first_day from events a group by a.user_id)a#这里取的是访问首日。如果需要付费首日,则所取first_day,需限定付费事件。left JOIN(seelct a.user_id,date,order_amount from events where event='PayOrder' )bon a.id=b.user_id and datediff(b.date,first_day)=0where b.user_id is not null group by a.user_id########把时间差改为 datediff(b.date,first_day)<=2即为3日内累计充值,注意标签名与此处的条件限制呼应################这里是取满足条件的用户及其特征值#########)bon a.user_id=b.user_id
2.2 近7日付费次数
selecta.user_id as id,if(b.pay_cishu is null,0,b.pay_cishu) as tag_firstday_pay_cishufrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(##########这里是取满足条件的用户及其特征值#########SELECT user_id, COUNT(*) AS pay_cishu FROM eventswhere date BETWEEN current_date() - INTERVAL 7 DAY AND current_date() - INTERVAL 1 DAY AND event='PayOrder'group by 1##########这里是取满足条件的用户及其特征值#########)bon a.user_id=b.user_id
3.首末次特征:首次付费时间/日期/付费金额
3.1末次付费时间,首次付费时间
select a.user_id, pay_lasttime as tag_pay_lasttime,pay_firsttime as tag_pay_firsttimefrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT user_id, MAX(time) AS pay_lasttime,min(time) AS pay_firsttime FROM eventswhere event='PayOrder'group by 1) bon a.user_id=b.user_id
3.2首次标签。首次付费金额/末次付费金额,首次付费渠道/末次付费渠道
select a.user_id, last_order_amount as tag_last_order_amount,first_order_amount as tag_first_order_amountfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(select user_id,first_order_amount,last_order_amountfrom(SELECT user_id, row_number( ) over(partition by user_id order by time asc) as rank,first_value(order_amount) over(partition by user_id order by time asc) as first_order_amount,last_value(order_amount) over(partition by user_id order by time asc) as last_order_amountFROM eventswhere event='PayOrder')a where a.rank=1) bon a.user_id=b.user_id

4.时间距离类标签
4.1 某个时间点距离今天数
select a.user_id, value as tag_lastview_juli_daysfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(#######最近7天最近一次访问距离今天天数#######SELECT user_id, DATEDIFF(now(), time) AS valueFROMSELECT user_id, MAX(time) AS timeFROM eventsWHERE event = 'View'GROUP BY 1) a#######最近7天访问距离今天天数#######) bon a.user_id=b.user_id
4.2间隔时长
复购间隔时长等。见间隔场景
这里以复购间隔为例
select a.user_id, rebuycircle as tag_rebuycirclefrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(#######复购周期天数#######select c.user_id,round(avg(days),0) as rebuycirclefrom(select b.user_id,datediff(b.lastpurtime,b.time) as daysfrom(select tt.*,LEAD(tt.time)over(partition by tt.user_id order by tt.time) as lastpurtimefrom(select t.user_id,t.time,t.order_id,a.pay_freqfrom(select user_id,time,event,order_id from events where event='PayOrder' order by time)t left join(select user_id,event,count(order_id) as pay_freq from events where event='PayOrder' group by user_id,event having count(order_id)>=2) aon t.user_id=a.user_idwhere a.user_id is not null)ttorder by tt.user_id,tt.time) b)cgroup by c.user_id#######复购周期天数#######) bon a.user_id=b.user_id
5.比例型标签
示例为:优惠券敏感型=订单中优惠券使用比例
SELECT a.user_id,value as tag__bilifrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT a.user_id , b.use_cnt a.get_cnt > 0.8 AS valueFROM (SELECT user_id, COUNT(*) AS get_cntFROM eventsWHERE event = 'receiveDiscount'GROUP BY 1) aJOIN ( SELECT user_id, COUNT(*) AS use_cntFROM eventsWHERE event = 'useCoupon'GROUP BY 1) bON a.user_id = b.user_id)bON a.user_id = b.user_id
6.区间分布类
6.1 根据某种行为特征(数值类:发生的频次/天数/总额/或者比值,或者文本类做判断)根据标签值范围,划分区间分布。示例:
SELECT a.user_id, CASE WHEN b.user_id IS NULL THEN '0'WHEN b.order_cnt <= 10 AND b.order_cnt > 0 THEN '(0-10]'WHEN b.order_cnt <= 50 AND b.order_cnt <= 10 THEN '(10-50]'WHEN b.order_cnt > 50 THEN '(50,+∞)' ELSE '0'END AS tag_paycntfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT user_id, COUNT(DISTINCT orderid) AS order_cntFROM eventsWHERE event = 'payOrder'GROUP BY 1) bON a.user_id = b.user_id
6.2 将数值类标签按照数据占据的比例进行区间划分
SELECT a.user_id, CASE WHEN b.user_id IS NULL THEN '0'WHEN (rank() over (order by b.order_cnt desc))/count(b.order_cnt) over ())<0.1 THEN '购买次数前10%'WHEN (rank() over (order by b.order_cnt desc))/count(b.order_cnt) over ())<0.7 THEN '购买次数前10%-70%'else '购买次数最后30%'END AS tag_paycnt_bilifrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT user_id, COUNT(DISTINCT orderid) AS order_cntFROM eventsWHERE event = 'payOrder'GROUP BY 1) bON a.user_id = b.user_id
7.行为偏好类标签
7.1 时间偏好
消费偏好是周末还是工作日
select a.user_id, value as tag_weekday_holidayfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT user_id AS id, distinct_id,CASE WHEN ratio_holiday >= 0.8 THEN '周末偏好'WHEN ratio_weekday >= 0.8 THEN '工作日偏好'ELSE '无明显偏好'END AS valueFROM (SELECT user_id,SUM(CASE WHEN a.date_type = '周末' THEN order_cnt ELSE 0 END) SUM(order_cnt) AS ratio_holiday, SUM(CASE WHEN a.date_type = '工作日' THEN order_cnt ELSE 0 END) SUM(order_cnt) AS ratio_weekday ----计算在周末/工作日消费的次数占总消费次数的比例FROM (SELECT user_id, CASE WHEN pmod(datediff(date, '1900-01-07'), 7) IN (6, 0) THEN '周末' ELSE '工作日'END AS date_type,COUNT(DISTINCT orderid) AS order_cnt ----计算消费当天所属‘周末’还是‘工作日’,以及每个用户的消费次数FROM events WHERE date<current_date()GROUP BY 1,2) aGROUP BY 1) b)bON a.user_id = b.user_id
消费的高峰时段或者时段偏好--同样可进一步判断时段是上午,下午,晚上,凌晨
select a.user_id, value as tag_View_Periodfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT user_id AS id, View_Period as value FROM(SELECT user_id, View_Period , ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY cnt DESC) AS row_num FROM (SELECT user_id, hour(time) as View_Period, COUNT(*) AS cnt FROM events ----返回提交订单的小时部分WHERE date BETWEEN current_date() - INTERVAL '29' DAY AND current_date() AND event = 'payOrder'GROUP BY 1,2) a) bWHERE row_num <= 1)bON a.user_id = b.user_id
7.2行为偏好型标签
「最近30天最常消费的商品二级品类」-偏好为单个值
select a.user_id, value as tag_secondCommodityfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT user_id AS id,secondCommodity AS valueFROM (SELECT user_id, secondCommodity, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY cnt DESC) AS row_num ----针对每个用户对商品二级品类的消费次数进行排序FROM (SELECT user_id, secondCommodity, MAX(distinct_id) AS distinct_id, COUNT(*) AS cnt ---- 计算每个用户对每种商品二级品类的消费量FROM eventsWHERE date BETWEEN current_date() - INTERVAL '29' DAY AND current_date()AND event = 'payOrderDetail'GROUP BY 1, 2 ---- 按照SELECT后面第一个和第二个字段进行分组) a) bWHERE row_num <= 1)bON a.user_id = b.user_id
创建「 Top N 的行为特征」的集合类标签
举例:创建「最近180天内用户消费最多的前三名商品二级品类」
select a.user_id, value as tag_secondCommodity_listfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECT user_id AS id, GROUP_CONCAT(secondCommodity) AS value ----将多个字符串进行合并FROM (SELECT user_id, secondCommodity, RANK() OVER (PARTITION BY user_id ORDER BY cnt DESC) AS rank_numFROM (SELECT user_id, secondCommodity, COUNT(*) AS cntFROM eventsWHERE event = 'payOrderDetail'GROUP BY 1, 2) a) bWHERE rank_num <= 3GROUP BY 1/* group_concat 函数可返回每个分组的非NULL值字符串以特定分割符连接的数据结果,默认以逗号分隔 */)bON a.user_id = b.user_id

8.连续N天标签和连续次数标签。
连续N天标签
select a.user_id, if(continuous_days is null,0,continuous_days) as tag_continuous_daysfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECTuser_id, continuous_group, COUNT(1) AS continuous_daysFROM(##原理:ROW_NUMBER 和日期天一样是步长为 1 的数列######如果用户是连续日期做这件事,日期减去ROW_NUMBER一定是相等的,以此构建分组###SELECTymd - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ymd ASC) AS continuous_group,user_id,eventFROM(-- 把用户做这个事件的日期取出来 SELECT DISTINCT DATEDIFF(date,'1970-01-01') AS ymd,user_id,event FROM events WHERE event='$AppStart') event_log) event_groupGROUP BY user_id,continuous_group)bON a.user_id = b.user_id
连续次数标签----参考间隔场景获取间隔次数。
9.一些组合型的标签处理规则
9.1 归一化处理的标签
与9.2联合以以RFM标签示例
9.2加权计算类标签,以RFM标签示例
当然RFM的计算还有其他方法。此处仅为演示归一化及加权计算。
select a.user_id,if(value is null,'',value) as tag_rfmfrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECTz.r_user_id,r*0.2+f*0.3+m*0.5 AS valueFROM(--RFM归一化结果SELECTr_user_id,(r_value-MIN(r_value) OVER ())/(MAX(r_value) OVER () - MIN(r_value) OVER ()) AS r,(f_value-MIN(f_value) OVER ())/(MAX(f_value) OVER () - MIN(f_value) OVER ()) AS f,(m_value-MIN(m_value) OVER ())/(MAX(m_value) OVER () - MIN(m_value) OVER ()) AS mFROM(--R表SELECTuser_id AS r_user_id,value AS r_valueFROMuser_tag_tablename_RWHERETO_DATE(FROM_UNIXTIME(CAST(base_time/1000 AS BIGINT),'yyyy-MM-dd'))=CAST('2020-11-05' AS TIMESTAMP)) tag_rLEFT JOIN(--F表SELECTuser_id AS f_user_id,value AS f_valueFROMuser_tag_tablename_FWHERETO_DATE(FROM_UNIXTIME(CAST(base_time/1000 AS BIGINT),'yyyy-MM-dd'))=CAST('2020-11-05' AS TIMESTAMP)) tag_fON tag_r.r_user_id=tag_f.f_user_idLEFT JOIN(--M表SELECTuser_id AS m_user_id,value AS m_valueFROMuser_tag_tablename_MWHERETO_DATE(FROM_UNIXTIME(CAST(base_time/1000 AS BIGINT),'yyyy-MM-dd'))=CAST('2020-11-05' AS TIMESTAMP)) tag_mON tag_r.r_user_id=tag_m.m_user_idWHEREtag_r.r_value IS NOT NULLAND tag_f.f_value IS NOT NULLAND tag_m.m_value IS NOT NULL) z)rfmON a.user_id = rfm.user_id
9.3特征变化类标签
select a.user_id,if(value is null,'不变',value) as tag_bianhuafrom (select a.user_id,min(date) as first_day from events a group by a.user_id) aleft join(SELECTtag_t0.user_id,CONCAT(tag_t1.value,'->',tag_t0.value) AS valueFROM(-- 获取今天的生命周期状态或者其他特征,这里假设创建好了每天的生命周期状态标签表##SELECT user_id, valueFROM user_tag_user_tag_public_lifecycle_statusWHERE date=current_date()) tag_t0LEFT JOIN(-- 昨天的生命周期状态或者其他特征SELECT user_id, valueFROM user_tag_user_tag_public_lifecycle_statusWHERE date=current_date() - INTERVAL '1' DAY) tag_t1ON tag_t0.user_id=tag_t1.user_idWHEREtag_t0.value<>tag_t1.valueAND tag_t1.value IS NOT NULL)bON a.user_id = b.user_id
9.4更加复杂的模型标签
像活跃度,健康度,忠诚度这种更多更加复杂的标签和特征无非是更复杂的运算或者标签之上的叠加计算。就不再展开讨论了。
能够看到最后的都是最棒的。文章本身不是为了让每个人都一点点耐心的看下去。只是作为一个沉淀,在需要的时候,可以拿出来参考。
本篇内容从有想法到写完发表出来,18点-0点20。因为18点前把公众号上的3篇文章到知乎上,就只是试试各个地方的内容创作~~~因为文字有点多,最近也没啥图,选的有点难~

感兴趣的私聊~啊~~累了,困了~~安~




