点击关注公众号,干货第一时间送达

这是一道计算 "中位数" 的题目,
前面应该没有出现过,
之前有:
SQL脚本案例【10】查询首次下单后第二天连续下单的用户比率
等求解 "平均数"、"比例/比率" 的演示,
此篇,给诸位补充个技能包!
【温馨提示:建表语句及数据导入脚本,已经放在文章末尾】
一、需求场景
题目:
查询商品信息表sku_info,各个品类商品价格的中位数。
备注:
如果某个品类商品的个数为偶数,则中位数是中间两个商品价格的平均值;
如果某个品类商品的个数为奇数,则中位数是中间价格
商品信息表sku_info:

正确结果:

二、解决方案
演示工具:Hive-3.1.3 + DataGrip2022
题目中已经指明,
中位数的求解规则,
所以我们在计算某个品类的中位数时候,
考虑该品类商品个数的奇偶性,
再加以判断即可。
【温馨提示:下面代码所涉及开窗函数 row_number(),已经在前面:
详细介绍过,此处不再赘述】
(1)按照品类category_id分组,价格price排序,计算各个商品排名。按照品类category_id分组,计算各个品类商品个数
select category_id,
sku_id,
price,
count(*) over (partition by category_id) as cn,
row_number() over (partition by category_id order by price) as rk
from sku_info
结果集:

小伙伴们注意到没?
此处的开窗排序用的是:row_number() 函数
可以用 rank() 和 dense_rank() 函数吗?
答案:不行。
举个例子:
商品价格分别是:21、22、22、22、23
row_number() 排序序号是:1、2、3、4、5
商品总个数是5,
则中位数应该是排名为3的价格:22
如果用 dense_rank()排序则序号是:1、2、2、2、3
取得的排名为3的价格就是:23了,显然错误!
如果用 rank() 排序则序号是:1、2、2、2、5
取不到排名为3的价格,显然也错误!
懂了不?
(2)考虑商品个数的奇偶性,取出价格的候选中位数
select category_id,
price
from (
select category_id,
sku_id,
price,
count(*) over (partition by category_id) as cn,
row_number() over (partition by category_id order by price) as rk
from sku_info
)t1
where (cn%2 = 0 and (rk=cn/2 or rk=cn/2+1))
or (cn%2=1 and rk=(cn+1)/2)
结果集:

商品个数cn,
如果是偶数:cn%2=0,
则需要取排序号中间和两个:cn/2、cn/2+1。
如果是奇数:cn%2=1,
则只需要取中间的序号即可:(cn+1)/2/
可以发现,取出来候选值后,
计算候选者的平均值即可。
(3)计算候选中位数的平均值,得出各个品类商品价格的中位数
select category_id,
avg(price) as midPrice
from (
select category_id,
sku_id,
price,
count(*) over (partition by category_id) as cn,
row_number() over (partition by category_id order by price) as rk
from sku_info
)t1
where (cn%2 = 0 and (rk=cn/2 or rk=cn/2+1))
or (cn%2=1 and rk=(cn+1)/2)
group by category_id
结果集:

三、源数据
商品信息表sku_info:
create table sku_info
(
sku_id string,
name string,
category_id string,
from_date string,
price double
)
row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
with serdeproperties ('field.delim' = '\t') stored as
inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://mycluster/user/hive/warehouse/hql_test2.db/sku_info'
tblproperties ('bucketing_version' = '2');
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('1', 'xiaomi 10', '1', '2020-01-01', 2000);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('2', '手机壳', '1', '2020-02-01', 10);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('3', 'apple 12', '1', '2020-03-01', 5000);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('4', 'xiaomi 13', '1', '2020-04-01', 6000);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('5', '破壁机', '2', '2020-01-01', 500);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('6', '洗碗机', '2', '2020-02-01', 2000);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('7', '热水壶', '2', '2020-03-01', 100);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('8', '微波炉', '2', '2020-04-01', 600);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('9', '自行车', '3', '2020-01-01', 1000);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('10', '帐篷', '3', '2020-02-01', 100);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('11', '烧烤架', '3', '2020-02-01', 50);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('12', '遮阳伞', '3', '2020-03-01', 20);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('13', '长粒香', '2', '2020-01-01', 20);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('14', '金龙鱼', '2', '2021-01-01', 20);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('15', '巧乐兹', '2', '2020-01-01', 20);
INSERT INTO hql_test2.sku_info (sku_id, name, category_id, from_date, price) VALUES ('16', '费列罗', '2', '2022-01-01', 20);





