窗口函数

MySQL8 窗口函数是一种特殊的函数，它可以在一组查询行上执行类似于聚合的操作，但是不会将查询行折叠为单个输出行，而是为每个查询行生成一个结果。窗口函数可以用来处理复杂的报表统计分析场景，例如计算移动平均值、累计和、排名等。它展现的主要威力在于「它能够让我们在不修改原有语句输出结果的基础上，直接添加新的聚合字段」。

语法解析

window_function_name ( [argument1, argument2, ...] ) 
OVER  (
  [ PARTITION BY col1, col2, ... ]
  [ORDER BY col3, col4, ...]
  [ ROWS | RANGE frame_start AND frame_end ]
)

window_function_name

window_function_name 函数可以是聚合函数或者非聚合函数。MySQL8 支持以下几类窗口函数，

序号函数：用于为窗口内的每一行生成一个序号，例如 ROW_NUMBER()，RANK()，DENSE_RANK() 等。
分布函数：用于计算窗口内的每一行在整个分区中的相对位置，例如 PERCENT_RANK()，CUME_DIST() 等。
前后函数：用于获取窗口内的当前行的前后某一行的值，例如 LAG()，LEAD() 等。
头尾函数：用于获取窗口内的第一行或最后一行的值，例如 FIRST_VALUE()，LAST_VALUE() 等。
聚合函数：用于计算窗口内的某个字段的聚合值，例如 SUM()，AVG()，MIN()，MAX() 等。

OVER

OVER 关键字很重要，用来标识是否使用窗口函数，语法如下

over_clause:
    {OVER (window_spec) | OVER window_name}

两种形式都定义了窗口函数应该如何处理查询行。它们的区别在于窗口是直接在 OVER() 中定义，还是基于 window_name 在 OVER 字句可以重复使用。

OVER() 常规用法，窗口规范直接出现在 OVER 子句中的括号之间。
OVER window_name 基于 Named Windows，是由查询中其他地方的 WINDOW 子句定义的窗口规范的名称，可以重复使用。

PARTITION BY

PARTITION BY子句用来将查询结果划分为不同的分区，窗口函数在每个分区上分别执行，语法如下

partition_clause:
    PARTITION BY expr [, expr] ..

ORDER BY

order_clause:
    ORDER BY expr [ASC|DESC] [, expr [ASC|DESC]] ...

frame_clause

frame_clause 是窗口函数的一个可选子句，用来指定每个分区内的数据范围，可以是静态的或动态的。语法如下

frame_clause:
    frame_units frame_extent

frame_units:
    {ROWS | RANGE}

frame_units表示窗口范围的单位，可以是ROWS或RANGE。ROWS表示基于行数，RANGE表示基于值的大小。

frame_extent表示窗口范围的起始位置和结束位置，可以是以下几种形式：

CURRENT ROW: 表示当前行。
UNBOUNDED PRECEDING: 表示分区中的第一行。
UNBOUNDED FOLLOWING: 表示分区中的最后一行。
expr PRECEDING: 表示当前行减去expr的值。
expr FOLLOWING: 表示当前行加上expr的值。

Named Windows

MySQL8的 Named Windows 是指在 WINDOW 子句中定义并命名的窗口，可以在 OVER 子句中通过窗口名来引用。使用 Named Windows 的好处是可以避免在多个OVER子句中重复定义相同的窗口，而只需要在 WINDOW 子句中定义一次，然后在 OVER 子句中引用即可。

SELECT
  val,
  ROW_NUMBER () OVER (ORDER BY val) AS 'row_number',
  RANK () OVER (ORDER BY val) AS 'rank',
  DENSE_RANK () OVER (ORDER BY val) AS 'dense_rank'
FROM numbers;

可以使用Named Windows来简化为：

SELECT
  val,
  ROW_NUMBER () OVER w AS 'row_number',
  RANK () OVER w AS 'rank',
  DENSE_RANK () OVER w AS 'dense_rank'
FROM numbers WINDOW w AS (ORDER BY val);

如果一个 OVER 子句使用了 OVER (window_name …) 而不是 OVER window_name，则可以在引用的窗口名后面添加其他子句来修改窗口。

SELECT
  DISTINCT year, country,
  FIRST_VALUE (year) OVER (w ORDER BY year ASC) AS first,
  FIRST_VALUE (year) OVER (w ORDER BY year DESC) AS last
FROM sales WINDOW w AS (PARTITION BY country);

- 可以根据不同的排序来获取每个国家的第一年和最后一年。

一个命名窗口的定义本身也可以以一个窗口名开头。这样可以实现窗口之间的引用，但不能形成循环。

SELECT
  val,
  SUM(val) OVER w1 AS sum_w1,
  SUM(val) OVER w2 AS sum_w2,
  SUM(val) OVER w3 AS sum_w3
FROM numbers
WINDOW
  w1 AS (ORDER BY val),
  w2 AS (w1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
  w3 AS (w2 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);

例题

CREATE TABLE `sales` (
  `id` int NOT NULL,
  `year` int DEFAULT NULL,
  `country` varchar(20) DEFAULT NULL,
  `product` varchar(20) DEFAULT NULL,
  `profit` int DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (1, 2000, 'Finland', 'Computer', 1500);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (2, 2000, 'Finland', 'Phone', 100);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (3, 2001, 'Finland', 'Phone', 10);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (4, 2001, 'India', 'Calculator', 75);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (5, 2000, 'India', 'Calculator', 75);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (6, 2000, 'India', 'Computer', 1200);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (7, 2000, 'USA', 'Calculator', 75);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (8, 2000, 'USA', 'Computer', 1500);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (9, 2001, 'USA', 'Calculator', 50);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (12, 2002, 'USA', 'Computer', 1200);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (13, 2001, 'USA', 'TV', 150);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (14, 2002, 'USA', 'TV', 100);
INSERT INTO `test_db`.`sales` (`id`, `year`, `country`, `product`, `profit`) VALUES (15, 2001, 'USA', 'Computer', 1500);

问题一

计算每个国家每年的总利润，并按照国家和年份排序

不使用窗口函数

SELECT year, country, SUM(profit) AS sum_profit
FROM sales
GROUP BY country, year
ORDER BY country, year;

+------+---------+------------+
| year | country | sum_profit |
+------+---------+------------+
| 2000 | Finland |       1600 |
| 2001 | Finland |         10 |
| 2000 | India   |       1275 |
| 2001 | India   |         75 |
| 2000 | USA     |       1575 |
| 2001 | USA     |       1700 |
| 2002 | USA     |       1300 |
+------+---------+------------+

使用窗口函数

SELECT year, country, SUM(profit) OVER (PARTITION BY country, year) AS sum_profit
FROM sales
ORDER BY country, year

+------+---------+------------+
| year | country | sum_profit |
+------+---------+------------+
| 2000 | Finland |       1600 |
| 2000 | Finland |       1600 |
| 2001 | Finland |         10 |
| 2000 | India   |       1275 |
| 2000 | India   |       1275 |
| 2001 | India   |         75 |
| 2000 | USA     |       1575 |
| 2000 | USA     |       1575 |
| 2001 | USA     |       1700 |
| 2001 | USA     |       1700 |
| 2001 | USA     |       1700 |
| 2002 | USA     |       1300 |
| 2002 | USA     |       1300 |
+------+---------+------------+

每个国家每年的总利润都被计算出来了，但是没有折叠为单个输出行，而是为每个查询行生成了一个结果。

问题二

计算每个国家每种产品的销售排名，并按照国家和排名排序

不使用窗口函数

SELECT country, product, COUNT(1) AS sell_count
FROM sales
GROUP BY country, product
ORDER BY country, sell_count;

+---------+------------+------------+
| country | product    | sell_count |
+---------+------------+------------+
| Finland | Computer   |          1 |
| Finland | Phone      |          2 |
| India   | Computer   |          1 |
| India   | Calculator |          2 |
| USA     | Calculator |          2 |
| USA     | TV         |          2 |
| USA     | Computer   |          3 |
+---------+------------+------------+

这里，不使用窗口函数只能根据sell_count来进行排序，无法单独加上一列rank，如果必须要单独加上一列rank，需要涉及到子查询 + 连接。是一个比较复杂的查询操作

使用窗口函数

SELECT country, product, COUNT(1) OVER (PARTITION BY country, product) AS sell_count
FROM sales
ORDER BY country, sell_count

+---------+------------+------------+
| country | product    | sell_count |
+---------+------------+------------+
| Finland | Computer   |          1 |
| Finland | Phone      |          2 |
| Finland | Phone      |          2 |
| India   | Computer   |          1 |
| India   | Calculator |          2 |
| India   | Calculator |          2 |
| USA     | Calculator |          2 |
| USA     | Calculator |          2 |
| USA     | TV         |          2 |
| USA     | TV         |          2 |
| USA     | Computer   |          3 |
| USA     | Computer   |          3 |
| USA     | Computer   |          3 |
+---------+------------+------------+

这里是使用窗口函数来实现根据sell_count排名，可以发现，这样使用窗口函数其实就是省下了GROUP BY分组

同时还能发现的是，如果直接使用GROUP BY，那么对于那些被分出来的小表中任然有超过一条的数据的来说，是无法输出的。

Expression #3 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test_db.sales.profit' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

使用窗口函数来优雅的解决

SELECT country, product, profit, RANK() OVER (PARTITION BY country ORDER BY profit DESC ) AS rank1
FROM sales
ORDER BY country, rank1;

+---------+------------+--------+-------+
| country | product    | profit | rank1 |
+---------+------------+--------+-------+
| Finland | Computer   |   1500 |     1 |
| Finland | Phone      |    100 |     2 |
| Finland | Phone      |     10 |     3 |
| India   | Computer   |   1200 |     1 |
| India   | Calculator |     75 |     2 |
| India   | Calculator |     75 |     2 |
| USA     | Computer   |   1500 |     1 |
| USA     | Computer   |   1500 |     1 |
| USA     | Computer   |   1200 |     3 |
| USA     | TV         |    150 |     4 |
| USA     | TV         |    100 |     5 |
| USA     | Calculator |     75 |     6 |
| USA     | Calculator |     50 |     7 |
+---------+------------+--------+-------+

这里我对窗口函数运行的理解如下，首先MYSQL 会忽略窗口函数，直接查询出结果，之后窗口函数会对原始查询出的结果按照 PARTITION BY 进行分组，按照 ORDER BY 进行排序，之后对于分好组的数据，运行 windows_function，得到结果，并将结果附加到原始数据的新列。

因此，窗口函数是在已经生成的查询结果基础上进行的“二次处理”，并不会影响到查询的其他部分。

问题三

计算每个国家每种产品的累计利润，并按照国家和利润排序

SELECT country, product, profit,
SUM(profit) OVER (PARTITION BY country ORDER BY profit
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_profit
FROM sales
ORDER BY country, profit;

+---------+------------+--------+-------------------+
| country | product    | profit | cumulative_profit |
+---------+------------+--------+-------------------+
| Finland | Phone      |     10 |                10 |
| Finland | Phone      |    100 |               110 |
| Finland | Computer   |   1500 |              1610 |
| India   | Calculator |     75 |                75 |
| India   | Calculator |     75 |               150 |
| India   | Computer   |   1200 |              1350 |
| USA     | Calculator |     50 |                50 |
| USA     | Calculator |     75 |               125 |
| USA     | TV         |    100 |               225 |
| USA     | TV         |    150 |               375 |
| USA     | Computer   |   1200 |              1575 |
| USA     | Computer   |   1500 |              3075 |
| USA     | Computer   |   1500 |              4575 |
+---------+------------+--------+-------------------+

感觉窗口函数是真的厉害，这种查询我之前是没有任何思路的

问题四

基于Named Window 重写问题三

SELECT country, product, profit, 
SUM(profit) OVER w1 AS cumulative_profit
FROM sales
WINDOW
w1 as (PARTITION BY country ORDER BY profit 
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ORDER BY country, profit
;

常用的窗口函数总结

函数名	用途
`ROW_NUMBER()`	为查询结果中的每一行分配唯一的递增整数。
`RANK()`	对结果集进行排名，相同值的行会共享同一排名，并且跳过排名。
`DENSE_RANK()`	对结果集进行排名，相同值的行会共享同一排名，但不会跳过排名。
`SUM()`	计算指定窗口内列的和。
`MIN()`	返回指定窗口内列的最小值。
`MAX()`	返回指定窗口内列的最大值。
`AVG()`	计算指定窗口内列的平均值。
`LAG()`	返回当前行的前一行的值。
`LAG(N)`	返回当前行的前 `N` 行的值，`N` 为指定的行数。
`LAG(col, n, DEFAULT)`	返回当前行的前 `n` 行的 `col` 列的值，如果不存在前 `n` 行，则返回默认值 `DEFAULT`。
`LEAD()`	返回当前行的后一行的值。
`LEAD(N)`	返回当前行的后 `N` 行的值，`N` 为指定的行数。
`LEAD(col, n, DEFAULT)`	返回当前行的后 `n` 行的 `col` 列的值，如果不存在后 `n` 行，则返回默认值 `DEFAULT`。
`FIRST_VALUE()`	返回指定窗口或分区内的第一个值。
`LAST_VALUE()`	返回指定窗口或分区内的最后一个值。
`NTILE()`	将结果集划分为n个桶，并将每一行分配到其中一个桶中。
`PERCENT_RANK()`	计算当前行在分区内的百分比排名，结果介于 `0` 和 `1` 之间。
`CUME_DIST()`	计算当前行及其之前的行在分区内的累积分布百分比，结果介于 `0` 和 `1` 之间。