


在营销域的“搜、广、推”业务中,对用户的运营是一个“Interest Targeting”循环迭代的过程:对全量用户画像分析-->挖掘目标用户-->AB测试:触达目标用户-->触达效果画像分析-->挖掘目标用户。这个过程抽象出来的技术需求是:人群圈选&画像分析,技术实现的架构大致如下:

● 成本高:用户标签数据生产好之后需要在ETL数仓和OLAP引擎存两份,这就意味着存储成本,同步作业成本和运维成本都很高,系统整体稳定性也存在潜在风险。
问题分析

圈人SQL实现:
insert overwrite into package_table_1select user_id from tbl_user_tag where tag1='x' and tag2='a'
select tag3, count(distinct user_id) from(select user_id from tbl_user_tag where tag1='x' and tag2='a')group by tag3
1. 技术方面:如果一张表里的字段很多,写时内存hold的block多,内存占用更多,理论上写延迟会更高;标签表一般默认全索引(给所有标签字段建立索引是合理的需求)的话build/load的cpu消耗也会更高。查询时元数据内存占用也会更多。

tbl_user_tag_2

圈人SQL实现:
select user_id from tbl_user_tag_1 where tag1='x'unionselect user_id from tbl_user_tag_2 where tag2='a'或者select user_id from tbl_user_tag_1inner join tbl_user_tag_2on tbl_user_tag_1.user_id = tbl_user_tag_2.user_idwhere tag1='x' and tag2='a'
这里列举了2种写法,只是写法不同,对计算引擎而言,问题都是引入了“大表JOIN”的问题,参与Join的数据规模和用户ID空间成正比,需要进行的Join计算次数和SQL条件复杂程度:需要使用到的标签表的个数成正比(此处只举了2个标签表的情况,实际中可能是十几个甚至几十个标签表进行Join)。“大表JOIN”对计算引擎的CPU资源消耗较高,执行时间也较长,因此传统解决方案中只能放在相对低成本的ETL引擎中处理。

针对上述业务痛点问题,AnalyticDB MySQL引入RoaringBitmap及相关计算函数来将圈人场景中“大表Join”问题转化成“小表Join”问题,并且在oss外表中原生支持Roaringbitmap类型,来解决多份标签数据和人群包导出的问题。
我们准备一份测试数据,来介绍一下各个场景的具体做法,数据处理整体流程大致如下:

步骤一 实时构建bitmap标签表
准备一个测试用的原始标签表users_base,假设所有的用户标签都放在users_base表中,创建测试用的原始标签数据表
create table users_base(uid int,tag1 string, // tag1 取值范围是:x, y, ztag2 string, // tag2 取值范围是:a, btag3 int // tag3 取值范围是:1-10)
// 生成1亿行随机数据submit jobinsert overwrite users_baseselect cast(row_number() over (order by c1) as int) as uid, substring('xyz', floor(rand() * 3) + 1, 1) as tag1, substring('ab', floor(rand() * 2) + 1, 1) as tag2, cast(floor(rand() * 10) + 1 as int) as tag3 from(select A.c1 fromunnest(rb_build_range(0, 10000)) as A(c1)join(select c1 fromunnest(rb_build_range(0, 10000)) as B(c1)))mysql> select * from users_base limit 10;+--------+------+------+------+| uid | tag1 | tag2 | tag3 |+--------+------+------+------+| 74526 | y | b | 3 || 75611 | z | b | 10 || 80850 | x | b | 5 || 81656 | z | b | 7 || 163845 | x | b | 2 || 167007 | y | b | 4 || 170541 | y | b | 9 || 213108 | x | a | 10 || 66056 | y | b | 4 || 67761 | z | a | 2 |+--------+------+------+------+
构建bitmap标签表, 在设计标签表的时候,考虑到利用分布式计算引擎的并发能力,我们引入一个分组字段(本例中使用user_group),将uid分组并行计算,分组的大小可以根据集群CU总数和实际业务情况进行取舍:
1. 一般分组越多计算能力越强,但是分组过多又会导致每个bitmap包含的元素太少,不能充分利用bitmap的计算特性。
// 创建添加分组字段后的用户原始表create table users(uid int,tag1 string,tag2 string,tag3 int,user_group int, 分组字段offset int 偏移)// 将users_base的数据写入userssubmit jobinsert overwrite users select uid, tag1, tag2, tag3, cast(uid%16 as int), cast(floor(uid/16) as int) from users_base// 检查写入后的数据mysql> select * from users limit 10;+---------+------+------+------+------------+--------+| uid | tag1 | tag2 | tag3 | user_group | offset |+---------+------+------+------+------------+--------+| 377194 | z | b | 10 | 10 | 23574 || 309440 | x | a | 1 | 0 | 19340 || 601745 | z | a | 7 | 1 | 37609 || 753751 | z | b | 3 | 7 | 47109 || 988186 | y | a | 10 | 10 | 61761 || 883822 | x | a | 9 | 14 | 55238 || 325065 | x | b | 6 | 9 | 20316 || 1042875 | z | a | 10 | 11 | 65179 || 928606 | y | b | 5 | 14 | 58037 || 990858 | z | a | 8 | 10 | 61928 |+---------+------+------+------+------------+--------+
// 创建tag1的标签表create Table `tag_tbl_1` (`tag1` string,`rb` roaringbitmap,`user_group` int) engine = 'oss'TABLE_PROPERTIES = '{"endpoint":"oss-cn-zhangjiakou.aliyuncs.com","accessid":"******","accesskey":"*************","url":"oss://your oss bucket/roaringbitmap/tag_tbl_1/","format":"parquet"}'// 写入tag1标签表insert overwrite tag_tbl_1 select tag1, rb_build_agg(offset), user_group from users group by tag1, user_group// 检查写入tag1标签表的数据, 可以看到和原始标签users表是对应的,例子中使用rb_to_array来展示数据,实际业务中因为rb包含的uid很多,请谨慎使用mysql> select tag1, user_group, rb_cardinality(rb) from tag_tbl_1;+------+------------+--------------------+| tag1 | user_group | rb_cardinality(rb) |+------+------------+--------------------+| z | 7 | 2082608 || x | 10 | 2082953 || y | 7 | 2084730 || x | 14 | 2084856 |.....| z | 15 | 2084535 || z | 5 | 2083204 || x | 11 | 2085239 || z | 1 | 2084879 |+------+------------+--------------------+48 rows in set (0.66 sec)// 创建tag2的标签表create Table `tag_tbl_2` (`tag2` string,`rb` roaringbitmap,`user_group` int)engine = 'oss'TABLE_PROPERTIES = '{"endpoint":"oss-cn-zhangjiakou.aliyuncs.com","accessid":"******","accesskey":"*************","url":"oss://your oss bucket/roaringbitmap/tag_tbl_2/","format":"parquet"}'// 写入tag2标签表insert overwrite tag_tbl_2 select tag2, rb_build_agg(offset), user_group from users group by tag2, user_group// 检查tag2标签表写入的数据select tag2, user_group, rb_cardinality(rb) from tag_tbl_2;mysql> select tag2, user_group, rb_cardinality(rb) from tag_tbl_2;+------+------------+--------------------+| tag2 | user_group | rb_cardinality(rb) |+------+------------+--------------------+| b | 11 | 3121361 || a | 6 | 3124750 || a | 1 | 3125433 |....| b | 2 | 3126523 || b | 12 | 3123452 || a | 4 | 3126111 || a | 13 | 3123316 || a | 2 | 3123477 |+------+------------+--------------------+32 rows in set (0.44 sec)
步骤二 使用bitmap标签表计算分析
过滤筛选分析:分析tag1 in (x, y) 的用户,在tag2维度的柱状图
// 为了方便理解,先求满足条件的用户是哪儿些select tag2, t1.user_group as user_group, rb_cardinality(rb_and(t2.rb, rb1)) as rb from tag_tbl_2 as t2join (select user_group, rb as rb1 from tag_tbl_1 where tag1 in ('x', 'y'))as t1on t1.user_group = t2.user_group// 求满足条件柱状图select tag2, sum(cnt) from (select tag2, t1.user_group as user_group, rb_cardinality(rb_and(t2.rb, rb1)) as cnt from tag_tbl_2 as t2join (select user_group, rb as rb1 from tag_tbl_1 where tag1 in ('x', 'y'))as t1on t1.user_group = t2.user_group) group by tag2+------+----------+| tag2 | sum(cnt) |+------+----------+| a | 33327868 || b | 33335220 |+------+----------+2 rows in set (0.67 sec)
这个场景使用传统解法的等效实现如下(注:这里模拟真实场景中“大表JOIN”使用子查询的方式查了同一张表的数据):
selecttag2,count(distinct A.uid)from(selectuidfromuserswheretag1 in ('x', 'y')) as Ainner join (selectuid,tag2fromusers) as B on A.uid = B.uidgroup bytag2;+------+-----------------------+| tag2 | count(distinct A.uid) |+------+-----------------------+| b | 33335220 || a | 33327868 |+------+-----------------------+2 rows in set (2.79 sec)
bitmap表交并差:求同时满足条件 tag1 = 'x' or tag2 = 'y' 和 tag2 = ‘b’的用户
select user_group , rb_cardinality(rb) from (selectt1.user_group as user_group,rb_and(rb1, rb2) as rbfrom(selectuser_group,rb_or_agg(rb) as rb1fromtag_tbl_1wheretag1 = 'x'or tag1 = 'y'group byuser_group) as t1join (selectuser_group,rb_or_agg(rb) as rb2fromtag_tbl_2wheretag2 = 'b'group byuser_group) as t2 on t1.user_group = t2.user_groupgroup by user_group)+------------+--------------------+| user_group | rb_cardinality(rb) |+------------+--------------------+| 10 | 2083679 || 3 | 2082370 || 9 | 2082847 || 2 | 2086511 |...| 1 | 2082291 || 4 | 2083290 || 14 | 2083581 || 15 | 2084110 |+------------+--------------------+16 rows in set (0.71 sec)
原始标签表和bitmap标签表交叉计算,求同时满足条件 tag1 = 'x' or tag2 = 'y' 和 tag2 = ‘b’的用户,但是tag2 = ‘b‘ 这个条件的数据从users表中读取,进行交叉计算。
select user_group , rb_cardinality(rb) from (selectt1.user_group as user_group,rb_and(rb1, rb2) as rbfrom(selectuser_group,rb_or_agg(rb) as rb1fromtag_tbl_1wheretag1 = 'x'or tag1 = 'y'group byuser_group) as t1join (selectuser_group,rb_build_agg(offset) as rb2fromuserswheretag2 = 'b'group byuser_group) as t2 on t1.user_group = t2.user_groupgroup by user_group)+------------+--------------------+| user_group | rb_cardinality(rb) |+------------+--------------------+| 3 | 2082370 || 1 | 2082291 || 0 | 2082383 || 4 | 2083290 || 11 | 2081662 || 13 | 2085280 |...| 14 | 2083581 || 15 | 2084110 || 9 | 2082847 || 8 | 2084860 || 5 | 2083056 || 7 | 2083275 |+------------+--------------------+16 rows in set (0.95 sec)
// 创建计算结果的标签表create Table `tag_tbl_3` (`rb` roaringbitmap)engine = 'oss'TABLE_PROPERTIES = '{"endpoint":"oss-cn-zhangjiakou.aliyuncs.com","accessid":"******","accesskey":"*************","url":"oss://your oss bucket/roaringbitmap/tag_tbl_3/","format":"parquet"}'// 将计算结果导出到结果标签表insert overwrite tag_tbl_3selectrb_and(rb1, rb2) as rb,t1.user_group as user_groupfrom(selectuser_group,rb_or_agg(rb) as rb1fromtag_tbl_1wheretag1 = 'x'or tag1 = 'y'group byuser_group) as t1join (selectuser_group,rb_or_agg(rb) as rb2fromtag_tbl_2wheretag2 = 'b'group byuser_group) as t2 on t1.user_group = t2.user_group// 执行结束之后文件会以parquet格式存放在 oss://your oss bucket/roaringbitmap/tag_tbl_3
场景5:加速查询
将tag1的标签表 的数据导入内表加速查询效果
// 创建一个内表,需要将rb字段定义为varbinary格式Create Table `tag_tbl_1_cstore` (`tag1` varchar,`rb` varbinary,`user_group` int)// 将tag1标签表的数据从oss导入到tag_tbl_1_cstoreinsert into tag_tbl_1_cstore select tag1, rb_to_varbinary(rb), user_group from tag_tbl_1// 检查导入的数据mysql> select tag1, user_group, rb_cardinality(rb_or_agg(rb_build_varbinary(rb))) from tag_tbl_1_cstore group by tag1, user_group;+------+------------+---------------------------------------------------+| tag1 | user_group | rb_cardinality(rb_or_agg(rb_build_varbinary(rb))) |+------+------------+---------------------------------------------------+| y | 3 | 2082919 || x | 9 | 2083085 || x | 3 | 2082140 || y | 11 | 2082268 || z | 4 | 2082451 |...| z | 2 | 2081560 || y | 6 | 2082194 || z | 7 | 2082608 |+------+------------+---------------------------------------------------+48 rows in set (0.16 sec)

Roaringbitmap的具体使用方式可以参考官方文档:
https://help.aliyun.com/document_detail/600552.html?spm=a2c4g.600977.0.i2
AnalyticDB MySQL升级为湖仓一体架构,支持高吞吐离线处理和高性能在线分析,可无缝替换CDH/TDH/Databricks/Presto/Spark/Hive等,欢迎点击文末「阅读原文」申请免费试用。

点击阅读原文免费试用云原生数据仓库AnalyticDB





