1.前言 又到了年底,最近在做年報數據拉取的需求,在這期間有一些數據的計算需要用到視窗函數,就去瞭解了一下常用的視窗函數的用法。 1.1.什麼是視窗函數 視窗函數是 SQL 中的一種特殊函數,它允許你在查詢結果中的某個視窗或視窗範圍上執行計算,而不是單獨針對整個結果集進行計算。 視窗函數通常與 OV ...
1.前言
又到了年底,最近在做年報數據拉取的需求,在這期間有一些數據的計算需要用到視窗函數,就去瞭解了一下常用的視窗函數的用法。
1.1.什麼是視窗函數
視窗函數是 SQL 中的一種特殊函數,它允許你在查詢結果中的某個視窗或視窗範圍上執行計算,而不是單獨針對整個結果集進行計算。
視窗函數通常與 OVER()
子句結合使用,用於定義視窗的範圍。這個範圍可以是整個結果集、分組的行、特定的物理行數範圍,或者是基於值範圍的視窗。
視窗函數通常在查詢的最後階段執行,即在結果集形成之後對結果集中的行應用視窗函數。視窗函數計算的結果是基於最終結果集的行,而不影響原始數據的分組,也就是說視窗函數的結果會添加到原查詢結果的每一行上,這是個人覺得這是OVEE(PARTITION BY COLUMN_NAME ORDER BY COLUMN_NAME)
與 傳統的 GROUP BY COLUMN ORDER BY COLUMN
差別最大的地方。
2.常用視窗函數
可以使用以下sql進行初始化
CREATE TABLE `video_play` (
`id` int NOT NULL AUTO_INCREMENT,
`user_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL COMMENT '用戶名',
`video_type` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL COMMENT '視頻類型',
`duration` int DEFAULT NULL COMMENT '播放時間',
`date` datetime DEFAULT NULL COMMENT '開始播放時間',
PRIMARY KEY (`id`)
);
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (1, 'zhangsan', 'cartoon', 126, '2023-12-20 10:23:23');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (2, 'zhangsan', 'movie', 38, '2023-12-20 12:40:32');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (3, 'zhangsan', 'teleplay', 98, '2023-12-20 13:25:53');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (4, 'zhangsan', 'news', 45, '2023-12-20 15:24:43');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (5, 'lisi', 'cartoon', 83, '2023-12-20 09:28:17');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (6, 'lisi', 'movie', 56, '2023-12-20 12:11:11');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (7, 'lisi', 'teleplay', 79, '2023-12-20 13:56:23');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (8, 'wangwu', 'cartoon', 156, '2023-12-20 08:17:14');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (9, 'wangwu', 'movie', 89, '2023-12-20 09:25:35');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (10, 'wangwu', 'teleplay', 986, '2023-12-20 11:45:11');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (11, 'wangwu', 'news', 124, '2023-12-20 16:56:01');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (12, 'xiaoming', 'cartoon', 123, '2023-12-20 07:08:34');
INSERT INTO `video_play` (`id`, `user_name`, `video_type`, `duration`, `date`) VALUES (13, 'xiaoming', 'movie', 963, '2023-12-20 09:13:23');
2.1.SUM()
用法:SUM(COLUMN_NAME) OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME]);
如果存在PARTITION BY COLUMN_NAME, 則代表查詢結果根據COLUMN_NAME進行分組求和。
如果存在ORDER BY COLUMN_NAME, 則代表查詢結果則根據COLUMN_NAME排序後, 按照排序的結果進行累積求和。這裡的累積求和是指在當前視窗內,當前行的值加上前行的值得到的結果。如果不加ORDER BY同一視窗內每一行的值都相同,為當前視窗內所有值的和。
sql舉例:
SELECT
*,
SUM(duration) OVER() AS exp1,
SUM(duration) OVER(PARTITION BY user_name) AS exp2,
SUM(duration) OVER(ORDER BY `date` ASC) AS exp3,
SUM(duration) OVER(PARTITION BY user_name ORDER BY `date` ASC) AS exp4
FROM
video_play;
2.2.COUNT()
用法:COUNT([DISTINCT] COLUMN_NAME) OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME])
整體用法跟SUM()函數類似,只是由求和變成了計數,並且支持DISTINCT去重,目前已知的是Mysql不支持。
sql舉例:
SELECT
*,
COUNT(duration) OVER() AS exp1,
COUNT(duration) OVER(PARTITION BY video_type) AS exp2,
COUNT(duration) OVER(ORDER BY duration) AS exp3,
COUNT(duration) OVER(PARTITION BY video_type ORDER BY duration ASC) AS exp4
FROM
video_play;
2.3.AVG()
用法:AVG(COLUMN_NAME) OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME]);
整體用法跟SUM()函數類似,只是由求和變成了求平均值。
sql舉例:
SELECT
*,
AVG(duration) OVER() AS exp1,
AVG(duration) OVER(PARTITION BY user_name) AS exp2,
AVG(duration) OVER(ORDER BY `date` ASC) AS exp3,
AVG(duration) OVER(PARTITION BY user_name ORDER BY `date` ASC) AS exp4
FROM
video_play;
2.4.ROW_NUMBER()
用法:ROW_NUMBER() OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME])
這個函數的功能是求結果的行號, 可以使用ORDER BY 進行內部排序之後再去求行號, 行號是連續的不會出現空缺
sql舉例:
SELECT
*,
ROW_NUMBER() OVER() AS exp1,
ROW_NUMBER() OVER(PARTITION BY user_name) AS exp2,
ROW_NUMBER() OVER(ORDER BY `date` ASC) AS exp3,
ROW_NUMBER() OVER(PARTITION BY user_name ORDER BY `date` ASC) AS exp4
FROM
video_play;
這個函數是這次年報中用的比較多的函數,簡化版的需求就是求每個用戶播放時長TOPN的類型及播放時間。實現思路是現根據用戶分組,再使用該函數對時長進行排序並生成行號,最後取每個用戶的前N行數據,下麵是實現的sql及結果
WITH user_video_play_index AS (
SELECT
user_name,
video_type,
duration,
ROW_NUMBER() OVER(PARTITION BY user_name ORDER BY duration DESC, date ASC) AS row_num
FROM
video_play
)
SELECT
user_name,
video_type,
duration,
row_num
FROM
user_video_play_index
WHERE
row_num <= 2
2.5.RANK()和DENSE_RANK()
用法:RANK()/DENSE_RANK() OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME])
這個函數的功能是求排名, 如果不使用ORDER BY指明要排序的列, 值會固定為1。如果排序之後的值相同, RANK值也會相同, 如果是RANK()函數,後面的值會跳過相同數量的間隔,DENSE_RANK()則不會(這次的數據集中的值沒有相同的,沒辦法體現出來這一點)。
sql舉例
SELECT
*,
RANK() OVER() AS exp1,
RANK() OVER(PARTITION BY user_name) AS exp2,
RANK() OVER(ORDER BY duration ASC) AS exp3,
RANK() OVER(PARTITION BY user_name ORDER BY duration ASC) AS exp4,
DENSE_RANK() OVER() AS exp5,
DENSE_RANK() OVER(PARTITION BY user_name) AS exp6,
DENSE_RANK() OVER(ORDER BY duration ASC) AS exp7,
DENSE_RANK() OVER(PARTITION BY user_name ORDER BY duration ASC) AS exp8
FROM
video_play
2.6.FIRST_VALUE()和LAST_VALUE()
用法:FIRST_VALUE(COLUMN_NAME)/LAST_VALUE(COLUMN_NAME) OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME]);
這兩個函數是用來求分組和排序之後的第一個值和最後一個值, LAST_VALUE會隨著視窗大小的增加而變化, 所以如果進行排序一般是當前行的值。
sql舉例:
SELECT
*,
FIRST_VALUE(duration) OVER() AS exp1,
FIRST_VALUE(duration) OVER(PARTITION BY user_name) AS exp2,
FIRST_VALUE(duration) OVER(ORDER BY duration ASC) AS exp3,
FIRST_VALUE(duration) OVER(PARTITION BY user_name ORDER BY duration ASC) AS exp4,
LAST_VALUE(duration) OVER() AS exp5,
LAST_VALUE(duration) OVER(PARTITION BY user_name) AS exp6,
LAST_VALUE(duration) OVER(ORDER BY duration ASC) AS exp7,
LAST_VALUE(duration) OVER(PARTITION BY user_name ORDER BY duration ASC) AS exp8
FROM
video_play
2.7.LAG()和LEAD()
用法:LAG(COLUMN_NAME, MOVE_LINE_NUM, DEFAULT_VALUE)/LEAD(COLUMN_NAME, MOVE_LINE_NUM, DEFAULT_VALUE) OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME]);
LAG和LEAD是分別用來向前或向後取值的函數; 函數裡面三個參數含義是從當前行向前或向後推MOVE_LINE_NUM
行, 然後去取該行的COLUMN_NAME
欄位, 如果向前或向後移動失敗(一般是超過視窗的頭部或超過視窗的尾部)則去取DEFAULT_VALUE
的值。
sql舉例:
SELECT
*,
LAG(duration, 2, 0) OVER() AS exp1,
LAG(duration, 2, 0) OVER(PARTITION BY user_name) AS exp2,
LAG(duration, 2, 0) OVER(ORDER BY duration ASC) AS exp3,
LAG(duration, 2, 0) OVER(PARTITION BY user_name ORDER BY duration ASC) AS exp4,
LEAD(duration, 2, 0) OVER() AS exp5,
LEAD(duration, 2, 0) OVER(PARTITION BY user_name) AS exp6,
LEAD(duration, 2, 0) OVER(ORDER BY duration ASC) AS exp7,
LEAD(duration, 2, 0) OVER(PARTITION BY user_name ORDER BY duration ASC) AS exp8
FROM
video_play
2.8.NTILE()
用法:NTILE(PART_NUM) OVER([PARTITION BY COLUMN_NAME] [ORDER BY COLUMN_NAME]);
NTILE函數是用來對數據進行分組的, 它會儘可能的把數據平均分為PART_NUM組,結果為當前行的組號。如果遇到無法均分的情況,mysql的邏輯時後面的組會比前面的組少1行數據,其他的資料庫暫時沒有測試。
sql舉例:
SELECT
*,
NTILE(2) OVER() AS exp1,
NTILE(2) OVER(PARTITION BY user_name) AS exp2,
NTILE(2) OVER(ORDER BY duration ASC) AS exp3,
NTILE(2) OVER(PARTITION BY user_name ORDER BY duration ASC) AS exp4
FROM
video_play
3.others
當使用視窗函數時,ROWS
、RANGE
和 GROUPS
是用於定義視窗框架的不同方式,它們用於指定視窗函數操作中所考慮的數據範圍。
-
ROWS
-
SELECT id, value, SUM(value) OVER (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS sum_values FROM your_table;
-
在這個例子中,
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
指定了視窗函數SUM()
的範圍,在當前行的前一行到後一行(共三行)的範圍內計算value
的總和。 -
ROWS
是基於物理行數來定義視窗範圍,獨立於分組邏輯。
-
-
RANGE
-
SELECT date_column, value, SUM(value) OVER (ORDER BY date_column RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND INTERVAL 3 DAY FOLLOWING) AS sum_values FROM your_table;
-
在這個例子中,
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND INTERVAL 3 DAY FOLLOWING
指定了視窗函數SUM()
的範圍,以當前行日期為中心,計算前七天到後三天內的value
總和。
-
-
GROUPS
-
SELECT id, value, SUM(value) OVER (PARTITION BY category ORDER BY id GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS sum_values FROM your_table;
-
在這個例子中,
GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING
指定了視窗函數SUM()
的範圍,以當前行所屬的category
分組為基礎,在每個分組內的前一行到後一行的範圍內計算value
的總和。這些不同的視窗框架定義方式允許在視窗函數中靈活地選擇不同的數據範圍進行計算和分析。
-
GROUPS
根據分組條件定義視窗範圍,對應於分組的邏輯;
-
github原文鏈接:https://github.com/YS1109/Note/blob/master/Mysql/window_func/window_func.md