暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

常见的几类行为标签

填充空白 2023-05-07
316

用户特征也称为用户标签。是描述用户某一方面(某一个维度)表现的抽象描述。

一般常说的,某个用户或者某个产品的人群画像是一线城市,男性,高收入,爱运动。

这里,一线城市是对用户活动IP的抽象,男是性别或者经常购买商品的性别偏好,高收入是对其月收入/年收入的概括,爱运动是兴趣爱好。

可以看到每一个特征都是用户的一个维度上的表现。有些是基本信息,有些是根据其行为抽象出的信息。


无论我们是做数据挖掘,需要大量的行为特征,还是客户服务,需要了解客户的各维度信息,还是人群分析发现更多的可能,种种都离不开用户特征。

而获取这些信息可能是填写,可能是购买,但是更多时候我们可以根据其再互联网上留下的痕迹抽象出来。


以下是常见的几类行为特征,本文会总结下常见的行为特征怎么用SQL从行为数据中抽取~

1.是否某类用户

2.某时间区内属性特征

3.首末次特征

4.时间距离类和间隔标签

5.比例型标签。

6.区间分布类标签

7.偏好类标签

8.连续型标签

9.一些组合型的标签处理规则


这里先放一个内容大纲。信息较多,所以截图看下。下边会仔细总结每个怎么取获取



假设当前:

用户行为表为events, 包含字段user_id(用户唯一id),event(事件名),time,date

       用户表为users.   包含id(用户唯一id),以及常规用户属性信息。

以下从行为表和用户表进行标签创建。生成的标签的格式为:用户id,标签名。


1.是否某类用户

比如,我们经常查看用户是否留存或者是否付费,与其他行为之间的相关性。

1.1是否次留用户

这里示例是否次留用户。

    select
    a.id as id,if(b.id is not null,1,0) as tag_is_ciliu
    from users a




    left join
    (
    select distinct a.id from users a
    left JOIN events b
    on a.id=b.user_id and datediff(b.time,to_timestamp($first_visit_time/1000))=1
    where b.user_id is not null ##次留的用户#
    )b
    on a.id=b.id

    1.2是否首日付费?是否付费?是否3日内付费?

    与1.1中类似,区别在于这里获取用户首日和留存数据均来自行为表。

      select
      a.user_id as id,if(b.user_id is not null,1,0) as tag_firstday_pay
      from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


      left join
      (
      select distinct a.user_id from
      (select a.user_id,min(date) as first_day from events a group by a.user_id)a
      left JOIN
      (select distinct a.user_id,date from events where event='PayOrder')b
      on a.id=b.user_id and datediff(b.date,first_day)=0
      where b.user_id is not null
      ##这里是首日付费的用户#
      ##去掉datediff(b.date,first_day)=0限制,则为有付费的用户#
      ##如果需要3日内是否付费,则在将datediff(b.date,first_day)=0改为<=2。####


      )b
      on a.user_id=b.user_id


      2.某时间区间内属性特征

      2.1首日/N日累计付费金额,付费次数,浏览次数等

        select a.user_id  as id,if(b.pay_money is null,0,b.pay_money) as tag_firstday_pay_money
        ##首日付费次数金额##
        ###改为if(b.pay_cishu is null,0,b.pay_cishu) as tag_firstday_pay_cishu时,则为首日付费次数####


        from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


        left join
        (##########这里是取满足条件的用户及其特征值#########
        select a.user_id,count(order_amount) as pay_cishu,sum(order_amount) as pay_money from
        (select a.user_id,min(date) as first_day from events a group by a.user_id)a
        #这里取的是访问首日。如果需要付费首日,则所取first_day,需限定付费事件。


        left JOIN
        (seelct a.user_id,date,order_amount from events where event='PayOrder' )b
        on a.id=b.user_id and datediff(b.date,first_day)=0
        where b.user_id is not null group by a.user_id
        ########把时间差改为 datediff(b.date,first_day)<=2即为3日内累计充值,注意标签名与此处的条件限制呼应######
        ##########这里是取满足条件的用户及其特征值#########


        )b
        on a.user_id=b.user_id


        2.2 近7日付费次数

          select
          a.user_id as id,if(b.pay_cishu is null,0,b.pay_cishu) as tag_firstday_pay_cishu


          from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


          left join
          (##########这里是取满足条件的用户及其特征值#########
          SELECT user_id, COUNT(*) AS pay_cishu FROM events
          where date BETWEEN current_date() - INTERVAL 7 DAY AND current_date() - INTERVAL 1 DAY AND event='PayOrder'
          group by 1
          ##########这里是取满足条件的用户及其特征值#########


          )b
          on a.user_id=b.user_id


          3.首末次特征:首次付费时间/日期/付费金额

          3.1末次付费时间,首次付费时间

            select a.user_id, pay_lasttime as tag_pay_lasttime,
            pay_firsttime as tag_pay_firsttime
            from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


            left join
            (SELECT user_id, MAX(time) AS pay_lasttime,min(time) AS pay_firsttime FROM events
            where event='PayOrder'
            group by 1) b
            on a.user_id=b.user_id



            3.2首次标签。首次付费金额/末次付费金额,首次付费渠道/末次付费渠道

              select a.user_id, last_order_amount as tag_last_order_amount,
              first_order_amount as tag_first_order_amount
              from (select a.user_id,min(date) as first_day from events a group by a.user_id) a




              left join
              (select user_id,first_order_amount,last_order_amount
              from
              (SELECT user_id, row_number( ) over(partition by user_id order by time asc) as rank,
              first_value(order_amount) over(partition by user_id order by time asc) as first_order_amount,
              last_value(order_amount) over(partition by user_id order by time asc) as last_order_amount
              FROM events
              where event='PayOrder')a where a.rank=1
              ) b
              on a.user_id=b.user_id

              4.时间距离类标签

              4.1 某个时间点距离今天数

                select a.user_id, value as tag_lastview_juli_days
                from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


                left join
                (
                #######最近7天最近一次访问距离今天天数#######
                SELECT user_id, DATEDIFF(now(), time) AS value
                FROM
                SELECT user_id, MAX(time) AS time
                FROM events
                WHERE event = 'View'
                GROUP BY 1) a
                #######最近7天访问距离今天天数#######
                ) b
                on a.user_id=b.user_id


                4.2间隔时长

                复购间隔时长等。见间隔场景

                这里以复购间隔为例

                  select a.user_id, rebuycircle as tag_rebuycircle
                  from (select a.user_id,min(date) as first_day from events a group by a.user_id) a




                  left join
                  (
                  #######复购周期天数#######
                  select c.user_id,round(avg(days),0) as rebuycircle
                  from
                  (select b.user_id,datediff(b.lastpurtime,b.time) as days
                  from
                  (select tt.*,LEAD(tt.time)over(partition by tt.user_id order by tt.time) as lastpurtime
                  from
                  (select t.user_id,t.time,t.order_id,a.pay_freq
                  from
                  (select user_id,time,event,order_id from events where event='PayOrder' order by time)t left join
                  (select user_id,event,count(order_id) as pay_freq from events where event='PayOrder' group by user_id,event having count(order_id)>=2) a
                  on t.user_id=a.user_id
                  where a.user_id is not null)tt
                  order by tt.user_id,tt.time) b
                  )c
                  group by c.user_id
                  #######复购周期天数#######


                  ) b
                  on a.user_id=b.user_id


                  5.比例型标签

                  示例为:优惠券敏感型=订单中优惠券使用比例

                    SELECT a.user_id,
                    value as tag__bili
                    from (select a.user_id,min(date) as first_day from events a group by a.user_id) a
                    left join


                    (
                    SELECT a.user_id , b.use_cnt a.get_cnt > 0.8 AS value
                    FROM (
                    SELECT user_id, COUNT(*) AS get_cnt
                    FROM events
                    WHERE event = 'receiveDiscount'
                    GROUP BY 1
                    ) a
                    JOIN ( SELECT user_id, COUNT(*) AS use_cnt
                    FROM events
                    WHERE event = 'useCoupon'
                    GROUP BY 1
                    ) b
                    ON a.user_id = b.user_id


                    )b
                    ON a.user_id = b.user_id


                    6.区间分布类

                    6.1 根据某种行为特征(数值类:发生的频次/天数/总额/或者比值,或者文本类做判断)根据标签值范围,划分区间分布。示例:

                      SELECT a.user_id
                      , CASE WHEN b.user_id IS NULL THEN '0'
                      WHEN b.order_cnt <= 10 AND b.order_cnt > 0 THEN '(0-10]'
                      WHEN b.order_cnt <= 50 AND b.order_cnt <= 10 THEN '(10-50]'
                      WHEN b.order_cnt > 50 THEN '(50,+∞)' ELSE '0'
                      END AS tag_paycnt
                      from (select a.user_id,min(date) as first_day from events a group by a.user_id) a
                      left join


                      (
                      SELECT user_id, COUNT(DISTINCT orderid) AS order_cnt
                      FROM events
                      WHERE event = 'payOrder'
                      GROUP BY 1
                      ) b
                      ON a.user_id = b.user_id


                      6.2 将数值类标签按照数据占据的比例进行区间划分

                        SELECT a.user_id
                        , CASE WHEN b.user_id IS NULL THEN '0'
                        WHEN (rank() over (order by b.order_cnt desc))/count(b.order_cnt) over ())<0.1 THEN '购买次数前10%'
                        WHEN (rank() over (order by b.order_cnt desc))/count(b.order_cnt) over ())<0.7 THEN '购买次数前10%-70%'
                        else '购买次数最后30%'
                        END AS tag_paycnt_bili
                        from (select a.user_id,min(date) as first_day from events a group by a.user_id) a
                        left join


                        (
                        SELECT user_id, COUNT(DISTINCT orderid) AS order_cnt
                        FROM events
                        WHERE event = 'payOrder'
                        GROUP BY 1
                        ) b
                        ON a.user_id = b.user_id


                        7.行为偏好类标签

                        7.1 时间偏好

                        消费偏好是周末还是工作日

                          select a.user_id, value as tag_weekday_holiday
                          from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


                          left join
                          (


                          SELECT user_id AS id, distinct_id,
                          CASE WHEN ratio_holiday >= 0.8 THEN '周末偏好'
                          WHEN ratio_weekday >= 0.8 THEN '工作日偏好'
                          ELSE '无明显偏好'
                          END AS value
                          FROM (
                          SELECT user_id,SUM(CASE WHEN a.date_type = '周末' THEN order_cnt ELSE 0 END) SUM(order_cnt) AS ratio_holiday
                          , SUM(CASE WHEN a.date_type = '工作日' THEN order_cnt ELSE 0 END) SUM(order_cnt) AS ratio_weekday ----计算在周末/工作日消费的次数占总消费次数的比例
                          FROM (
                          SELECT user_id, CASE WHEN pmod(datediff(date, '1900-01-07'), 7) IN (6, 0) THEN '周末' ELSE '工作日'END AS date_type,
                          COUNT(DISTINCT orderid) AS order_cnt ----计算消费当天所属‘周末’还是‘工作日’,以及每个用户的消费次数
                          FROM events WHERE date<current_date()
                          GROUP BY 1,2
                          ) a
                          GROUP BY 1
                          ) b
                          )b
                          ON a.user_id = b.user_id


                          消费的高峰时段或者时段偏好--同样可进一步判断时段是上午,下午,晚上,凌晨

                            select a.user_id, value as tag_View_Period
                            from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


                            left join
                            (
                            SELECT user_id AS id, View_Period as value FROM
                            (
                            SELECT user_id, View_Period , ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY cnt DESC) AS row_num FROM (
                            SELECT user_id, hour(time) as View_Period, COUNT(*) AS cnt FROM events ----返回提交订单的小时部分
                            WHERE date BETWEEN current_date() - INTERVAL '29' DAY AND current_date() AND event = 'payOrder'
                            GROUP BY 1,2) a
                            ) b
                            WHERE row_num <= 1
                            )b
                            ON a.user_id = b.user_id


                            7.2行为偏好型标签

                            「最近30天最常消费的商品二级品类」-偏好为单个值

                              select a.user_id, value as tag_secondCommodity
                              from (select a.user_id,min(date) as first_day from events a group by a.user_id) a




                              left join
                              (


                              SELECT user_id AS id,secondCommodity AS value
                              FROM (
                              SELECT user_id, secondCommodity, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY cnt DESC) AS row_num ----针对每个用户对商品二级品类的消费次数进行排序
                              FROM (
                              SELECT user_id, secondCommodity, MAX(distinct_id) AS distinct_id
                              , COUNT(*) AS cnt ---- 计算每个用户对每种商品二级品类的消费量
                              FROM events
                              WHERE date BETWEEN current_date() - INTERVAL '29' DAY AND current_date()
                              AND event = 'payOrderDetail'
                              GROUP BY 1, 2 ---- 按照SELECT后面第一个和第二个字段进行分组
                              ) a
                              ) b
                              WHERE row_num <= 1
                              )b
                              ON a.user_id = b.user_id


                              创建「 Top N 的行为特征」的集合类标签

                              举例:创建「最近180天内用户消费最多的前三名商品二级品类」

                                select a.user_id, value as tag_secondCommodity_list
                                from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


                                left join
                                (
                                SELECT user_id AS id, GROUP_CONCAT(secondCommodity) AS value ----将多个字符串进行合并
                                FROM (
                                SELECT user_id, secondCommodity, RANK() OVER (PARTITION BY user_id ORDER BY cnt DESC) AS rank_num
                                FROM (
                                SELECT user_id, secondCommodity
                                , COUNT(*) AS cnt
                                FROM events
                                WHERE event = 'payOrderDetail'
                                GROUP BY 1, 2
                                ) a
                                ) b
                                WHERE rank_num <= 3
                                GROUP BY 1
                                /* group_concat 函数可返回每个分组的非NULL值字符串以特定分割符连接的数据结果,默认以逗号分隔 */


                                )b
                                ON a.user_id = b.user_id

                                8.连续N天标签和连续次数标签。

                                连续N天标签

                                  select a.user_id, if(continuous_days is null,0,continuous_days) as tag_continuous_days
                                  from (select a.user_id,min(date) as first_day from events a group by a.user_id) a




                                  left join
                                  (
                                  SELECT
                                  user_id, continuous_group, COUNT(1) AS continuous_days
                                  FROM
                                  (##原理:ROW_NUMBER 和日期天一样是步长为 1 的数列###
                                  ###如果用户是连续日期做这件事,日期减去ROW_NUMBER一定是相等的,以此构建分组###
                                  SELECT
                                  ymd - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ymd ASC) AS continuous_group,user_id,event
                                  FROM
                                  (-- 把用户做这个事件的日期取出来 SELECT DISTINCT DATEDIFF(date,'1970-01-01') AS ymd,user_id,event FROM events WHERE event='$AppStart') event_log
                                  ) event_group
                                  GROUP BY user_id,continuous_group




                                  )b
                                  ON a.user_id = b.user_id


                                  连续次数标签----参考间隔场景获取间隔次数。


                                  9.一些组合型的标签处理规则

                                  9.1 归一化处理的标签

                                  与9.2联合以以RFM标签示例


                                  9.2加权计算类标签,以RFM标签示例

                                  当然RFM的计算还有其他方法。此处仅为演示归一化及加权计算。

                                    select a.user_id,if(value is null,'',value) as tag_rfm
                                    from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


                                    left join


                                    (SELECT
                                        z.r_user_id,
                                    r*0.2+f*0.3+m*0.5 AS value
                                    FROM
                                    (--RFM归一化结果
                                    SELECT
                                    r_user_id,
                                    (r_value-MIN(r_value) OVER ())/(MAX(r_value) OVER () - MIN(r_value) OVER ()) AS r,
                                    (f_value-MIN(f_value) OVER ())/(MAX(f_value) OVER () - MIN(f_value) OVER ()) AS f,
                                    (m_value-MIN(m_value) OVER ())/(MAX(m_value) OVER () - MIN(m_value) OVER ()) AS m
                                    FROM
                                    (--R表
                                    SELECT
                                    user_id AS r_user_id,
                                    value AS r_value
                                    FROM
                                    user_tag_tablename_R
                                    WHERE
                                    TO_DATE(FROM_UNIXTIME(CAST(base_time/1000 AS BIGINT),'yyyy-MM-dd'))=CAST('2020-11-05' AS TIMESTAMP)
                                    ) tag_r
                                    LEFT JOIN
                                    (--F表
                                    SELECT
                                    user_id AS f_user_id,
                                    value AS f_value
                                    FROM
                                    user_tag_tablename_F
                                    WHERE
                                    TO_DATE(FROM_UNIXTIME(CAST(base_time/1000 AS BIGINT),'yyyy-MM-dd'))=CAST('2020-11-05' AS TIMESTAMP)
                                    ) tag_f
                                    ON tag_r.r_user_id=tag_f.f_user_id
                                    LEFT JOIN
                                    (--M表
                                    SELECT
                                    user_id AS m_user_id,
                                    value AS m_value
                                    FROM
                                    user_tag_tablename_M
                                    WHERE
                                    TO_DATE(FROM_UNIXTIME(CAST(base_time/1000 AS BIGINT),'yyyy-MM-dd'))=CAST('2020-11-05' AS TIMESTAMP)
                                    ) tag_m
                                    ON tag_r.r_user_id=tag_m.m_user_id
                                    WHERE
                                    tag_r.r_value IS NOT NULL
                                    AND tag_f.f_value IS NOT NULL
                                    AND tag_m.m_value IS NOT NULL
                                    ) z
                                    )rfm


                                    ON a.user_id = rfm.user_id


                                    9.3特征变化类标签

                                      select a.user_id,if(value is null,'不变',value) as tag_bianhua
                                      from (select a.user_id,min(date) as first_day from events a group by a.user_id) a


                                      left join


                                      (
                                      SELECT
                                          tag_t0.user_id,
                                      CONCAT(tag_t1.value,'->',tag_t0.value) AS value
                                      FROM
                                      (-- 获取今天的生命周期状态或者其他特征,这里假设创建好了每天的生命周期状态标签表##
                                      SELECT user_id, value
                                      FROM user_tag_user_tag_public_lifecycle_status
                                      WHERE date=current_date()
                                      ) tag_t0
                                      LEFT JOIN
                                      (-- 昨天的生命周期状态或者其他特征
                                      SELECT user_id, value
                                      FROM user_tag_user_tag_public_lifecycle_status
                                      WHERE date=current_date() - INTERVAL '1' DAY
                                      ) tag_t1
                                      ON tag_t0.user_id=tag_t1.user_id
                                      WHERE
                                      tag_t0.value<>tag_t1.value
                                      AND tag_t1.value IS NOT NULL
                                      )b


                                      ON a.user_id = b.user_id


                                      9.4更加复杂的模型标签

                                      像活跃度,健康度,忠诚度这种更多更加复杂的标签和特征无非是更复杂的运算或者标签之上的叠加计算。就不再展开讨论了。



                                      能够看到最后的都是最棒的。文章本身不是为了让每个人都一点点耐心的看下去。只是作为一个沉淀,在需要的时候,可以拿出来参考。


                                      本篇内容从有想法到写完发表出来,18点-0点20。因为18点前把公众号上的3篇文章到知乎上,就只是试试各个地方的内容创作~~~因为文字有点多,最近也没啥图,选的有点难~

                                      感兴趣的私聊~啊~~累了,困了~~安~

                                      文章转载自填充空白,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                      评论