MySQL8.0窗口函数实验
窗口函数是一种强大的数据分析工具,允许在数据集的特定"窗口"(数据子集)内执行计算,同时保留原始行数据。以下是其核心使用方法和场景总结:
一、窗口函数分类及常用函数
序号函数:
ROW_NUMBER():生成连续唯一序号(无并列),如1,2,3。
SELECT name, score, ROW_NUMBER() OVER (ORDER BY score DESC) AS rank FROM students;
RANK():允许并列但序号跳跃(如1,1,3)。
DENSE_RANK():允许并列且序号连续(如1,1,2)。
分布函数:
PERCENT_RANK():计算百分比排名(范围[0,1])。
CUME_DIST():计算累积分布值(如小于等于当前值的行占比)。
前后函数:
LAG(expr, N):获取当前行前第 N 行的值。
LEAD(expr, N):获取当前行后第 N 行的值。
SELECT order_date, LAG(amount, 1) OVER (ORDER BY date) AS prev_amount FROM orders;
二、基本语法结构
函数名([参数]) OVER (
[PARTITION BY 分组字段]
[ORDER BY 排序字段]
[ROWS/RANGE 窗口范围]
)
PARTITION BY:按字段分组,窗口函数在组内独立计算(类似GROUP BY,但保留所有行)。
ORDER BY:定义窗口内的排序规则。
窗口范围:
ROWS:基于物理行数(如ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING包含前一行、当前行、后一行)。
RANGE:基于逻辑值范围(如按时间差值动态扩展窗口)。
滑动窗口的范围指定方式
基于行的范围指定
通常使用 BETWEEN frame_start AND frame_end 语法表示行范围,frame_start 和 frame_end 支持以下关键字定义动态行记录边界:
CURRENT_ROW
边界是当前行,一般与其他范围关键字配合使用。
UNBOUNDED_PRECEDING
边界是分区中的第一行。
UNBOUNDED_FOLLOWING
边界是分区中的最后一行。
expr_PRECEDING
边界是当前行减去 expr 的值(expr 为数值表达式)。
expr_FOLLOWING
边界是当前行加上 expr 的值(expr 为数值表达式)。
举例:
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
窗口范围包含当前行、前一行、后一行,共 3 行记录。ROWS UNBOUNDED FOLLOWING
窗口范围从当前行到分区中的最后一行。ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
窗口范围为当前分区所有行,等同于省略范围指定。
三、实验部分
环境初始化参考文章【Mysql8.0窗口函数用法】
1.UNBOUNDED PRECEDING表示边界永远为第一行:基于第一行进行累加
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
ROW_NUMBER() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS row_num,
DENSE_RANK() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS dept_rank,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS UNBOUNDED PRECEDING) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 02:28: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> ROW_NUMBER() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS row_num,
-> DENSE_RANK() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS dept_rank,
-> SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS UNBOUNDED PRECEDING) AS salary_total
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+---------+-----------+--------------+
| emp_no | dept_no | salary | row_num | dept_rank | salary_total |
+--------+---------+--------+---------+-----------+--------------+
| 10001 | d005 | 60117 | 10 | 10 | 60117 |
| 10001 | d005 | 62102 | 9 | 9 | 122219 |
| 10001 | d005 | 66074 | 8 | 8 | 188293 |
| 10001 | d005 | 66596 | 7 | 7 | 254889 |
| 10001 | d005 | 66961 | 6 | 6 | 321850 |
| 10001 | d005 | 71046 | 5 | 5 | 392896 |
| 10001 | d005 | 74333 | 4 | 4 | 467229 |
| 10001 | d005 | 75286 | 3 | 3 | 542515 |
| 10001 | d005 | 75994 | 2 | 2 | 618509 |
| 10001 | d005 | 76884 | 1 | 1 | 695393 |
+--------+---------+--------+---------+-----------+--------------+
10 rows in set (0.00 sec)
2.CURRENT ROW表示获取当前行记录,也就是边界是当前行,等值关系
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
ROW_NUMBER() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS row_num,
DENSE_RANK() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS dept_rank,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary range current row) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 02:28: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> ROW_NUMBER() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS row_num,
-> DENSE_RANK() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS dept_rank,
-> SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary range current row) AS salary_total
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+---------+-----------+--------------+
| emp_no | dept_no | salary | row_num | dept_rank | salary_total |
+--------+---------+--------+---------+-----------+--------------+
| 10001 | d005 | 60117 | 10 | 10 | 60117 |
| 10001 | d005 | 62102 | 9 | 9 | 62102 |
| 10001 | d005 | 66074 | 8 | 8 | 66074 |
| 10001 | d005 | 66596 | 7 | 7 | 66596 |
| 10001 | d005 | 66961 | 6 | 6 | 66961 |
| 10001 | d005 | 71046 | 5 | 5 | 71046 |
| 10001 | d005 | 74333 | 4 | 4 | 74333 |
| 10001 | d005 | 75286 | 3 | 3 | 75286 |
| 10001 | d005 | 75994 | 2 | 2 | 75994 |
| 10001 | d005 | 76884 | 1 | 1 | 76884 |
+--------+---------+--------+---------+-----------+--------------+
10 rows in set (0.01 sec)
3.UNBOUNDED FOLLOWING 表示边界永远为最后一行
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
ROW_NUMBER() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS row_num,
DENSE_RANK() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS dept_rank,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS BETWEEN UNBOUNDED preceding and UNBOUNDED FOLLOWING) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 02:39: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> ROW_NUMBER() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS row_num,
-> DENSE_RANK() OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary DESC) AS dept_rank,
-> SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS BETWEEN UNBOUNDED preceding and UNBOUNDED FOLLOWING) AS salary_total
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+---------+-----------+--------------+
| emp_no | dept_no | salary | row_num | dept_rank | salary_total |
+--------+---------+--------+---------+-----------+--------------+
| 10001 | d005 | 60117 | 10 | 10 | 695393 |
| 10001 | d005 | 62102 | 9 | 9 | 695393 |
| 10001 | d005 | 66074 | 8 | 8 | 695393 |
| 10001 | d005 | 66596 | 7 | 7 | 695393 |
| 10001 | d005 | 66961 | 6 | 6 | 695393 |
| 10001 | d005 | 71046 | 5 | 5 | 695393 |
| 10001 | d005 | 74333 | 4 | 4 | 695393 |
| 10001 | d005 | 75286 | 3 | 3 | 695393 |
| 10001 | d005 | 75994 | 2 | 2 | 695393 |
| 10001 | d005 | 76884 | 1 | 1 | 695393 |
+--------+---------+--------+---------+-----------+--------------+
10 rows in set (0.00 sec)
4.EXPR PRECEDING/FOLLOWING 表示带表达式的边界。
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS 1 PRECEDING) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 01:55: [employees]> select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10
-> ;
+--------+---------+--------+
| emp_no | dept_no | salary |
+--------+---------+--------+
| 10001 | d005 | 60117 |
| 10001 | d005 | 62102 |
| 10001 | d005 | 66074 |
| 10001 | d005 | 66596 |
| 10001 | d005 | 66961 |
| 10001 | d005 | 71046 |
| 10001 | d005 | 74333 |
| 10001 | d005 | 75286 |
| 10001 | d005 | 75994 |
| 10001 | d005 | 76884 |
+--------+---------+--------+
10 rows in set (0.01 sec)
root@db 01:55: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS 1 PRECEDING) AS salary_total
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+--------------+
| emp_no | dept_no | salary | salary_total |
+--------+---------+--------+--------------+
| 10001 | d005 | 60117 | 60117 |
| 10001 | d005 | 62102 | 122219 |
| 10001 | d005 | 66074 | 128176 |
| 10001 | d005 | 66596 | 132670 |
| 10001 | d005 | 66961 | 133557 |
| 10001 | d005 | 71046 | 138007 |
| 10001 | d005 | 74333 | 145379 |
| 10001 | d005 | 75286 | 149619 |
| 10001 | d005 | 75994 | 151280 |
| 10001 | d005 | 76884 | 152878 |
+--------+---------+--------+--------------+
10 rows in set (0.36 sec)
上面的查询表达的意思是基于分组内每行记录和它上一条记录求和,不累加。
//求每行和它上面两行的和:
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS 2 PRECEDING) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 01:57: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS 2 PRECEDING) AS salary_total
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+--------------+
| emp_no | dept_no | salary | salary_total |
+--------+---------+--------+--------------+
| 10001 | d005 | 60117 | 60117 |
| 10001 | d005 | 62102 | 122219 |
| 10001 | d005 | 66074 | 188293 |
| 10001 | d005 | 66596 | 194772 |
| 10001 | d005 | 66961 | 199631 |
| 10001 | d005 | 71046 | 204603 |
| 10001 | d005 | 74333 | 212340 |
| 10001 | d005 | 75286 | 220665 |
| 10001 | d005 | 75994 | 225613 |
| 10001 | d005 | 76884 | 228164 |
+--------+---------+--------+--------------+
10 rows in set (0.00 sec)
//求每行的前两行和后面四行相加的结果
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS between 2 PRECEDING and 4 FOLLOWING) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 02:00: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS between 2 PRECEDING and 4 FOLLOWING) AS salary_total
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+--------------+
| emp_no | dept_no | salary | salary_total |
+--------+---------+--------+--------------+
| 10001 | d005 | 60117 | 321850 |
| 10001 | d005 | 62102 | 392896 |
| 10001 | d005 | 66074 | 467229 |
| 10001 | d005 | 66596 | 482398 |
| 10001 | d005 | 66961 | 496290 |
| 10001 | d005 | 71046 | 507100 |
| 10001 | d005 | 74333 | 440504 |
| 10001 | d005 | 75286 | 373543 |
| 10001 | d005 | 75994 | 302497 |
| 10001 | d005 | 76884 | 228164 |
+--------+---------+--------+--------------+
10 rows in set (0.00 sec)
5.RANGE PRECEDING/FOLLOWING
表示当前行值范围内的分组记录。ROWS对应的是行号,RANGE对应的行值。
计算逻辑是:计算当前行salary减去1的值,如果计算的值在结果集中不存在,此时salary_total的值为salary当前值,如果计算的值在结果集中存在,此时salary_total的值为salary当前值加上【salary的值减去1】。
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary RANGE 1 PRECEDING) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 02:13: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary, SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary RANGE 1 PRECEDING) AS salary_total FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+--------------+
| emp_no | dept_no | salary | salary_total |
+--------+---------+--------+--------------+
| 10001 | d005 | 60117 | 60117 |
| 10001 | d005 | 62102 | 62102 |
| 10001 | d005 | 66074 | 66074 |
| 10001 | d005 | 66596 | 66596 |
| 10001 | d005 | 66961 | 66961 |
| 10001 | d005 | 71046 | 71046 |
| 10001 | d005 | 74333 | 74333 |
| 10001 | d005 | 75286 | 75286 |
| 10001 | d005 | 75994 | 75994 |
| 10001 | d005 | 76884 | 76884 |
+--------+---------+--------+--------------+
10 rows in set (0.00 sec)
//RANGE between 1 PRECEDING and 1 FOLLOWING:对每行值-1和+1后对应的值进行求和
计算逻辑:计算当前行值-1和+1后对应的值,如果在结果集中都不存在,则salary_total的值取salary的当前值;
如果在结果集中存在,则salary_total的值取salary的当前值+对应的值。
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary RANGE between 1 PRECEDING and 1 FOLLOWING ) AS salary_total
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
root@db 02:13: [employees]> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary RANGE between 1 PRECEDING and 1 FOLLOWING ) AS salary_total
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
+--------+---------+--------+--------------+
| emp_no | dept_no | salary | salary_total |
+--------+---------+--------+--------------+
| 10001 | d005 | 60117 | 60117 |
| 10001 | d005 | 62102 | 62102 |
| 10001 | d005 | 66074 | 66074 |
| 10001 | d005 | 66596 | 66596 |
| 10001 | d005 | 66961 | 66961 |
| 10001 | d005 | 71046 | 71046 |
| 10001 | d005 | 74333 | 74333 |
| 10001 | d005 | 75286 | 75286 |
| 10001 | d005 | 75994 | 75994 |
| 10001 | d005 | 76884 | 76884 |
+--------+---------+--------+--------------+
10 rows in set (0.00 sec)
6.row_number()
##查询员工部门表dept_emp中,最晚进入公司的人员名单
select * from (
select row_number() over (partition by dept_no order by from_date asc) as row_num,
emp_no,
dept_no,
from_date,
to_date
from dept_emp) t where row_num=1;
或者写成如下
select * from (
select row_number() over w as row_num,
emp_no,
dept_no,
from_date,
to_date
from dept_emp
WINDOW w AS (partition by dept_no order by from_date asc)
) t where row_num=1;
mysql> select * from (
-> select row_number() over (partition by dept_no order by from_date desc) as row_num,
-> emp_no,
-> dept_no,
-> from_date,
-> to_date
-> from dept_emp) t where row_num=1;
+---------+--------+---------+------------+------------+
| row_num | emp_no | dept_no | from_date | to_date |
+---------+--------+---------+------------+------------+
| 1 | 483985 | d001 | 2002-07-30 | 9999-01-01 |
| 1 | 445457 | d002 | 2002-07-29 | 9999-01-01 |
| 1 | 109702 | d003 | 2002-07-28 | 9999-01-01 |
| 1 | 413250 | d004 | 2002-08-01 | 9999-01-01 |
| 1 | 83118 | d005 | 2002-08-01 | 9999-01-01 |
| 1 | 227384 | d006 | 2002-08-01 | 9999-01-01 |
| 1 | 71987 | d007 | 2002-07-31 | 9999-01-01 |
| 1 | 282455 | d008 | 2002-08-01 | 9999-01-01 |
| 1 | 408784 | d009 | 2002-08-01 | 9999-01-01 |
+---------+--------+---------+------------+------------+
9 rows in set (0.52 sec)
##查询员工部门表dept_emp中,最早进入公司的人员名单
mysql> select * from (
-> select row_number() over (partition by dept_no order by from_date asc) as row_num,
-> emp_no,
-> dept_no,
-> from_date,
-> to_date
-> from dept_emp) t where row_num=1;
+---------+--------+---------+------------+------------+
| row_num | emp_no | dept_no | from_date | to_date |
+---------+--------+---------+------------+------------+
| 1 | 110022 | d001 | 1985-01-01 | 9999-01-01 |
| 1 | 110085 | d002 | 1985-01-01 | 9999-01-01 |
| 1 | 110183 | d003 | 1985-01-01 | 9999-01-01 |
| 1 | 110303 | d004 | 1985-01-01 | 9999-01-01 |
| 1 | 110511 | d005 | 1985-01-01 | 9999-01-01 |
| 1 | 110725 | d006 | 1985-01-01 | 9999-01-01 |
| 1 | 111035 | d007 | 1985-01-01 | 9999-01-01 |
| 1 | 111400 | d008 | 1985-01-01 | 9999-01-01 |
| 1 | 111692 | d009 | 1985-01-01 | 9999-01-01 |
+---------+--------+---------+------------+------------+
9 rows in set (0.42 sec)
7.如果窗口比较多,可以给窗口设置别名
select * from (
select row_number() over w as row_num,
emp_no,
dept_no,
from_date,
to_date
from dept_emp
WINDOW w AS (partition by dept_no order by from_date DESC)
) t where row_num=1;
select * from (
select row_number() over w as row_num,
emp_no,
dept_no,
from_date,
to_date
from dept_emp
WINDOW w AS (partition by dept_no order by from_date DESC ROWS between 1 PRECEDING and 1 FOLLOWING)
) t;
8.ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
窗口范围包含当前行、前一行、后一行,共 3 行记录
写法一:
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
SUM(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS between 1 PRECEDING and 1 FOLLOWING ) AS salary_total,
avg(tmp.salary) OVER (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS between 1 PRECEDING and 1 FOLLOWING ) AS salary_avg
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp;
写法二:
SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
SUM(tmp.salary) OVER s AS salary_total,
AVG(tmp.salary) OVER s AS salary_avg
FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp WINDOW s AS (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS between 1 PRECEDING and 1 FOLLOWING );
mysql> SELECT tmp.emp_no,tmp.dept_no, tmp.salary,
-> SUM(tmp.salary) OVER s AS salary_total,
-> AVG(tmp.salary) OVER s AS salary_avg
-> FROM (select de.emp_no,de.dept_no,sa.salary from dept_emp de, salaries sa where de.emp_no=sa.emp_no and dept_no='d005' limit 10) tmp WINDOW s AS (PARTITION BY tmp.dept_no ORDER BY tmp.salary ROWS between 1 PRECEDING and 1 FOLLOWING );
+--------+---------+--------+--------------+------------+
| emp_no | dept_no | salary | salary_total | salary_avg |
+--------+---------+--------+--------------+------------+
| 10001 | d005 | 60117 | 122219 | 61109.5000 |
| 10001 | d005 | 62102 | 188293 | 62764.3333 |
| 10001 | d005 | 66074 | 194772 | 64924.0000 |
| 10001 | d005 | 66596 | 199631 | 66543.6667 |
| 10001 | d005 | 66961 | 204603 | 68201.0000 |
| 10001 | d005 | 71046 | 212340 | 70780.0000 |
| 10001 | d005 | 74333 | 220665 | 73555.0000 |
| 10001 | d005 | 75286 | 225613 | 75204.3333 |
| 10001 | d005 | 75994 | 228164 | 76054.6667 |
| 10001 | d005 | 76884 | 152878 | 76439.0000 |
+--------+---------+--------+--------------+------------+
10 rows in set (0.01 sec)
9.LAG(expr,N)/LEAD(expr,N)
LAG(expr,N)和LEAD(expr,N)这两个函数的功能是获取当前数据行按照某种排序规则的上N行(LAG)/下N行(LEAD)数据的某个字段。比如,统计工资表中两条记录的时间间隔,那么就可以用LAG函数来实现,SQL代码如下:
mysql> desc salaries;
+-----------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| emp_no | int(11) | NO | PRI | NULL | |
| salary | int(11) | NO | | NULL | |
| from_date | date | NO | PRI | NULL | |
| to_date | date | NO | | NULL | |
+-----------+---------+------+-----+---------+-------+
4 rows in set (0.01 sec)
select emp_no,from_date,last_date,datediff(from_date,last_date) as diff
from
(
select emp_no,
from_date,
lag(from_date,1) over w as last_date,
to_date
from salaries
WINDOW w AS (partition by emp_no order by from_date asc)
) t limit 10;
+--------+------------+------------+------+
| emp_no | from_date | last_date | diff |
+--------+------------+------------+------+
| 10001 | 1986-06-26 | NULL | NULL |
| 10001 | 1987-06-26 | 1986-06-26 | 365 |
| 10001 | 1988-06-25 | 1987-06-26 | 365 |
| 10001 | 1989-06-25 | 1988-06-25 | 365 |
| 10001 | 1990-06-25 | 1989-06-25 | 365 |
| 10001 | 1991-06-25 | 1990-06-25 | 365 |
| 10001 | 1992-06-24 | 1991-06-25 | 365 |
| 10001 | 1993-06-24 | 1992-06-24 | 365 |
| 10001 | 1994-06-24 | 1993-06-24 | 365 |
| 10001 | 1995-06-24 | 1994-06-24 | 365 |
+--------+------------+------------+------+
10 rows in set (6.49 sec)
select emp_no,from_date,last_date,datediff(from_date,last_date) as diff
from
(
select emp_no,
from_date,
lead(from_date,1) over w as last_date,
to_date
from salaries
WINDOW w AS (partition by emp_no order by from_date asc)
) t limit 10;
+--------+------------+------------+------+
| emp_no | from_date | last_date | diff |
+--------+------------+------------+------+
| 10001 | 1986-06-26 | 1987-06-26 | -365 |
| 10001 | 1987-06-26 | 1988-06-25 | -365 |
| 10001 | 1988-06-25 | 1989-06-25 | -365 |
| 10001 | 1989-06-25 | 1990-06-25 | -365 |
| 10001 | 1990-06-25 | 1991-06-25 | -365 |
| 10001 | 1991-06-25 | 1992-06-24 | -365 |
| 10001 | 1992-06-24 | 1993-06-24 | -365 |
| 10001 | 1993-06-24 | 1994-06-24 | -365 |
| 10001 | 1994-06-24 | 1995-06-24 | -365 |
| 10001 | 1995-06-24 | 1996-06-23 | -365 |
+--------+------------+------------+------+
10 rows in set (7.20 sec)
10.FIRST_VALUE (expr) /LAST_VALUE (expr)
FIRST_VALUE (expr) 函数和 LAST_VALUE (expr) 函数的功能分别是获得滑动窗口范围内的参数字段中第一个 (FIRST_VALUE) 和最后一个 (LAST_VALUE) 的值。SQL 语句如下:
select *
from
(
select emp_no,
salary,
from_date,
first_value(salary) over w as first_salary,
last_value(salary) over w as last_salary,
to_date
from salaries
WINDOW w AS (partition by emp_no order by from_date asc)
) t limit 10;
+--------+--------+------------+--------------+-------------+------------+
| emp_no | salary | from_date | first_salary | last_salary | to_date |
+--------+--------+------------+--------------+-------------+------------+
| 10001 | 60117 | 1986-06-26 | 60117 | 60117 | 1987-06-26 |
| 10001 | 62102 | 1987-06-26 | 60117 | 62102 | 1988-06-25 |
| 10001 | 66074 | 1988-06-25 | 60117 | 66074 | 1989-06-25 |
| 10001 | 66596 | 1989-06-25 | 60117 | 66596 | 1990-06-25 |
| 10001 | 66961 | 1990-06-25 | 60117 | 66961 | 1991-06-25 |
| 10001 | 71046 | 1991-06-25 | 60117 | 71046 | 1992-06-24 |
| 10001 | 74333 | 1992-06-24 | 60117 | 74333 | 1993-06-24 |
| 10001 | 75286 | 1993-06-24 | 60117 | 75286 | 1994-06-24 |
| 10001 | 75994 | 1994-06-24 | 60117 | 75994 | 1995-06-24 |
| 10001 | 76884 | 1995-06-24 | 60117 | 76884 | 1996-06-23 |
+--------+--------+------------+--------------+-------------+------------+
10 rows in set (5.40 sec)
11.聚合函数作为窗口函数
除了前面介绍的各类窗口函数外,我们经常使用的各种聚合函数(SUM/AVG/MAX/MIN/COUNT)也可以作为窗口函数来使用。如下统计员工年收入总和,平均年收入,最大年收入,最小年收入,可以用聚合函数作为窗口函数实现如下:
select * from
(
select emp_no,
salary,
sum(salary) over w as sum1,
avg(salary) over w as avg1,
max(salary) over w as max1,
min(salary) over w as min1,
count(salary) over w as count1,
from_date,
to_date
from salaries
WINDOW w AS (partition by emp_no order by from_date asc)
) t limit 10;
+--------+--------+--------+------------+-------+-------+--------+------------+------------+
| emp_no | salary | sum1 | avg1 | max1 | min1 | count1 | from_date | to_date |
+--------+--------+--------+------------+-------+-------+--------+------------+------------+
| 10001 | 60117 | 60117 | 60117.0000 | 60117 | 60117 | 1 | 1986-06-26 | 1987-06-26 |
| 10001 | 62102 | 122219 | 61109.5000 | 62102 | 60117 | 2 | 1987-06-26 | 1988-06-25 |
| 10001 | 66074 | 188293 | 62764.3333 | 66074 | 60117 | 3 | 1988-06-25 | 1989-06-25 |
| 10001 | 66596 | 254889 | 63722.2500 | 66596 | 60117 | 4 | 1989-06-25 | 1990-06-25 |
| 10001 | 66961 | 321850 | 64370.0000 | 66961 | 60117 | 5 | 1990-06-25 | 1991-06-25 |
| 10001 | 71046 | 392896 | 65482.6667 | 71046 | 60117 | 6 | 1991-06-25 | 1992-06-24 |
| 10001 | 74333 | 467229 | 66747.0000 | 74333 | 60117 | 7 | 1992-06-24 | 1993-06-24 |
| 10001 | 75286 | 542515 | 67814.3750 | 75286 | 60117 | 8 | 1993-06-24 | 1994-06-24 |
| 10001 | 75994 | 618509 | 68723.2222 | 75994 | 60117 | 9 | 1994-06-24 | 1995-06-24 |
| 10001 | 76884 | 695393 | 69539.3000 | 76884 | 60117 | 10 | 1995-06-24 | 1996-06-23 |
+--------+--------+--------+------------+-------+-------+--------+------------+------------+
10 rows in set (24.32 sec)
12.NFILE(N)
NFILE()函数的功能是对一个数据分区中的有序结果集进行划分,将其分为N个组,并为每个小组分配一个唯一的组编号。对员工工资进行分组,NFILE()函数记录每组组编号,SQL 代码如下:
select * from
(
select ntile(3) over w as nf,
emp_no,
salary,
from_date,
to_date
from salaries
WINDOW w AS (partition by emp_no order by from_date asc)
) t limit 20;
+------+--------+--------+------------+------------+
| nf | emp_no | salary | from_date | to_date |
+------+--------+--------+------------+------------+
| 1 | 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 1 | 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 1 | 10001 | 66074 | 1988-06-25 | 1989-06-25 |
| 1 | 10001 | 66596 | 1989-06-25 | 1990-06-25 |
| 1 | 10001 | 66961 | 1990-06-25 | 1991-06-25 |
| 1 | 10001 | 71046 | 1991-06-25 | 1992-06-24 |
| 2 | 10001 | 74333 | 1992-06-24 | 1993-06-24 |
| 2 | 10001 | 75286 | 1993-06-24 | 1994-06-24 |
| 2 | 10001 | 75994 | 1994-06-24 | 1995-06-24 |
| 2 | 10001 | 76884 | 1995-06-24 | 1996-06-23 |
| 2 | 10001 | 80013 | 1996-06-23 | 1997-06-23 |
| 2 | 10001 | 81025 | 1997-06-23 | 1998-06-23 |
| 3 | 10001 | 81097 | 1998-06-23 | 1999-06-23 |
| 3 | 10001 | 84917 | 1999-06-23 | 2000-06-22 |
| 3 | 10001 | 85112 | 2000-06-22 | 2001-06-22 |
| 3 | 10001 | 85097 | 2001-06-22 | 2002-06-22 |
| 3 | 10001 | 88958 | 2002-06-22 | 9999-01-01 |
| 1 | 10002 | 65828 | 1996-08-03 | 1997-08-03 |
| 1 | 10002 | 65909 | 1997-08-03 | 1998-08-03 |
| 2 | 10002 | 67534 | 1998-08-03 | 1999-08-03 |
+------+--------+--------+------------+------------+
20 rows in set (6.06 sec)
13.PERCENT_RANK()/CUME_DIST()
PERCENT_RANK() 和 CUME_DIST() 这两个函数都是计算数据分布的函数。PERCENT_RANK()和之前的 RANK()函数相关,每行按照以下公式进行计算:
(rank - 1) / (rows - 1) 其中,rank 为 RANK()函数产生的序号,rows 为当前窗口的记录总行数。上面的例子修改如下:
select * from
(
select rank() over w as row_num,
percent_rank() over w as percent,
emp_no,
salary,
from_date,
to_date
from salaries
WINDOW w AS (partition by emp_no order by from_date asc)
) t limit 20;
+---------+---------+--------+--------+------------+------------+
| row_num | percent | emp_no | salary | from_date | to_date |
+---------+---------+--------+--------+------------+------------+
| 1 | 0 | 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 2 | 0.0625 | 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 3 | 0.125 | 10001 | 66074 | 1988-06-25 | 1989-06-25 |
| 4 | 0.1875 | 10001 | 66596 | 1989-06-25 | 1990-06-25 |
| 5 | 0.25 | 10001 | 66961 | 1990-06-25 | 1991-06-25 |
| 6 | 0.3125 | 10001 | 71046 | 1991-06-25 | 1992-06-24 |
| 7 | 0.375 | 10001 | 74333 | 1992-06-24 | 1993-06-24 |
| 8 | 0.4375 | 10001 | 75286 | 1993-06-24 | 1994-06-24 |
| 9 | 0.5 | 10001 | 75994 | 1994-06-24 | 1995-06-24 |
| 10 | 0.5625 | 10001 | 76884 | 1995-06-24 | 1996-06-23 |
| 11 | 0.625 | 10001 | 80013 | 1996-06-23 | 1997-06-23 |
| 12 | 0.6875 | 10001 | 81025 | 1997-06-23 | 1998-06-23 |
| 13 | 0.75 | 10001 | 81097 | 1998-06-23 | 1999-06-23 |
| 14 | 0.8125 | 10001 | 84917 | 1999-06-23 | 2000-06-22 |
| 15 | 0.875 | 10001 | 85112 | 2000-06-22 | 2001-06-22 |
| 16 | 0.9375 | 10001 | 85097 | 2001-06-22 | 2002-06-22 |
| 17 | 1 | 10001 | 88958 | 2002-06-22 | 9999-01-01 |
| 1 | 0 | 10002 | 65828 | 1996-08-03 | 1997-08-03 |
| 2 | 0.2 | 10002 | 65909 | 1997-08-03 | 1998-08-03 |
| 3 | 0.4 | 10002 | 67534 | 1998-08-03 | 1999-08-03 |
+---------+---------+--------+--------+------------+------------+
20 rows in set (5.40 sec)
rank的值取row_num列的值,rows的值为(emp_no的值为10001有17行,emp_no的值为10002有3行),percent列的值按(rank - 1) / (rows - 1)计算得出。
14.CUME_DIST函数的作用是分组内小于等于当前rank值的行数/分组内总行数
select * from
(
select rank() over w as row_num,
cume_dist() over w as cume,
emp_no,
salary,
from_date,
to_date
from salaries
WINDOW w AS (partition by emp_no order by from_date asc)
) t limit 25;
+---------+----------------------+--------+--------+------------+------------+
| row_num | cume | emp_no | salary | from_date | to_date |
+---------+----------------------+--------+--------+------------+------------+
| 1 | 0.058823529411764705 | 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 2 | 0.11764705882352941 | 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 3 | 0.17647058823529413 | 10001 | 66074 | 1988-06-25 | 1989-06-25 |
| 4 | 0.23529411764705882 | 10001 | 66596 | 1989-06-25 | 1990-06-25 |
| 5 | 0.29411764705882354 | 10001 | 66961 | 1990-06-25 | 1991-06-25 |
| 6 | 0.35294117647058826 | 10001 | 71046 | 1991-06-25 | 1992-06-24 |
| 7 | 0.4117647058823529 | 10001 | 74333 | 1992-06-24 | 1993-06-24 |
| 8 | 0.47058823529411764 | 10001 | 75286 | 1993-06-24 | 1994-06-24 |
| 9 | 0.5294117647058824 | 10001 | 75994 | 1994-06-24 | 1995-06-24 |
| 10 | 0.5882352941176471 | 10001 | 76884 | 1995-06-24 | 1996-06-23 |
| 11 | 0.6470588235294118 | 10001 | 80013 | 1996-06-23 | 1997-06-23 |
| 12 | 0.7058823529411765 | 10001 | 81025 | 1997-06-23 | 1998-06-23 |
| 13 | 0.7647058823529411 | 10001 | 81097 | 1998-06-23 | 1999-06-23 |
| 14 | 0.8235294117647058 | 10001 | 84917 | 1999-06-23 | 2000-06-22 |
| 15 | 0.8823529411764706 | 10001 | 85112 | 2000-06-22 | 2001-06-22 |
| 16 | 0.9411764705882353 | 10001 | 85097 | 2001-06-22 | 2002-06-22 |
| 17 | 1 | 10001 | 88958 | 2002-06-22 | 9999-01-01 |
| 1 | 0.16666666666666666 | 10002 | 65828 | 1996-08-03 | 1997-08-03 |
| 2 | 0.3333333333333333 | 10002 | 65909 | 1997-08-03 | 1998-08-03 |
| 3 | 0.5 | 10002 | 67534 | 1998-08-03 | 1999-08-03 |
| 4 | 0.6666666666666666 | 10002 | 69366 | 1999-08-03 | 2000-08-02 |
| 5 | 0.8333333333333334 | 10002 | 71963 | 2000-08-02 | 2001-08-02 |
| 6 | 1 | 10002 | 72527 | 2001-08-02 | 9999-01-01 |
| 1 | 0.14285714285714285 | 10003 | 40006 | 1995-12-03 | 1996-12-02 |
| 2 | 0.2857142857142857 | 10003 | 43616 | 1996-12-02 | 1997-12-02 |
+---------+----------------------+--------+--------+------------+------------+
25 rows in set (5.26 sec)
20 rows in set (5.96 sec)
cume的值取row_num列的值,rows的值为(emp_no的值为10001有17行,emp_no的值为10002有6行),cume列的值按(row_num) / (rows)计算得出。
也就是 1/17、2/17.....




