Presto之常用SQL函数（2）

大数据渡劫 2020-01-14

4336

上篇文章见Presto之常用SQL函数（1）

>>>>

聚合函数：min_by、max_by

可代替row_number()、简洁高效、hive无此函数

min_by(a, b) → [same as a]

返回与b的最小值对应的a值

min_by(a, b, n) → array<[same as a]>

返回与b的前n最小值对应的a值的数组

max_by(a, b) → [same as a]

返回与b的最大值对应的a值

# 建表
create table table1 
(name      string comment '用户姓名'
,age       string comment '用户年龄'
,store_name    string comment '门店名称'
,district_name string comment '用户所在区'
,distance      double comment '距离（公里）')
comment '用户近7日访问某页面时距离最近门店的距离及所在区';


# 插入数据
insert into table1 values
 ('茜', '18', '门店A', '上海市闵行区', 1.5)
,('茜', '18', '门店B', '上海市普陀区', 3)
,('茜', '18', '门店C', '上海市闵行区', 4.69)
,('茜', '18', '门店D', '上海市宝山区', 15)
,('茜', '18', '门店E', '上海市青浦区', 4.1);


# 如果觉得建表、插入表过于麻烦，可选择：
select *
from (select '茜' userguid, '18' age, '门店A' store_name, '上海市闵行区' district_name, 1.5 distance
      union all
      select '茜', '18', '门店B', '上海市普陀区', 3
      union all
      select '茜', '18', '门店C', '上海市闵行区', 4.69
      union all
      select '茜', '18', '门店D', '上海市宝山区', 15
      union all
      select '茜', '18', '门店E', '上海市浦东新区', 4.1
      )

# 需求：选择出用户茜距离最近的门店和当时所在区域
select name
       ,age
       ,min_by(store_name, distance)    as store_name
       ,min_by(district_name, distance) as district_name
       ,min(distance)                   as distance 
from table1
group by 1, 2;
# 结果返回：最小distance对应的门店名称和用户所在区，类似于row_number()，按距离升序取rank=1
# name | age | store_name | district_name | distance
# ----- ----- ------------ --------------- ---------
# 茜   | 18  |    门店A    |   上海市闵行区  | 1.5    


select name
       ,age
       ,min_by(store_name, distance, 2)    as store_name
       ,min_by(district_name, distance, 2) as district_name
       ,min_by(distance,distance,2)        as distance 
from table1
group by 1, 2;
# 结果返回：前两个最小distance对应的门店名称和用户所在区，类似于row_number()，按距离升序取rank>=2
# name | age | store_name |      district_name       | distance
# ----- ----- ------------ -------------------------- ---------
# 茜   | 18  |[门店A, 门店B]| [上海市闵行区, 上海市普陀区] | [1.5, 3]

>>>>

行转列：array_agg

array_agg(x) → array<[same as input]>
从输入的元素中创建数组（聚合函数）

# 需求：选择出用户茜所有出现的区
select name, age, array_agg(distinct district_name) as district_name
from table1
group by name, age
# name | age | district_name
# ----- ----- --------------
# 茜   | 18  | [上海市浦东新区, 上海市闵行区, 上海市普陀区, 上海市宝山区]


# 如果不想返回数组类型，可将array_agg(distinct district_name)改为array_join(array_agg(distinct district_name),',')，返回如下
# name | age | district_name
# ----- ----- --------------
# 茜   | 18  | 上海市浦东新区, 上海市闵行区, 上海市普陀区, 上海市宝山区

>>>>

正则表达式

regexp_extract_all(string, pattern, [group]) → array<varchar>
提取字符串string中所有与模式pattern匹配的子串。pattern中如果使用了分组的功能，则可以通过设置group参数，用于说明匹配哪个捕获组

select regexp_extract_all('1a 2b 14m', '\d+')
# [1, 2, 14]
select regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2)
# ['a', 'b', 'm']

regexp_extract(string, pattern, [group]) → varchar
功能和用法与regexp_extract_all类似，只是本函数只提取第一个匹配的结果

SELECT regexp_extract('1a 2b 14m', '\d+')
# 1
SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2)
# 'a'

regexp_like(string, pattern) → boolean
判断字符串string中是否包含符合pattern模式的子串，包含返回TRUE，否则返回FALSE。本函数的功能与 SQL 中的LIKE语句功能相似，不同的是LIKE需要匹配整个模式字串，而本函数只需要字串中包含与模式字串相匹配子串即返回TRUE

SELECT regexp_like('1a 2b 14m', '\d+b')
# true

数据库

文章转载自大数据渡劫，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Presto之常用SQL函数（2）

评论