關聯發散是開發常用的獲取特定彙總數據的方法,但是使用這類方法意味著承擔數據爆炸的風險。本篇通過一個典型案例,給出了“求所有值中大於本行值的最小值”的一個調優方案。 ...
本文分享自華為雲社區《GaussDB(DWS)性能調優:求欄位全體值中大於本行值的最小值——多次關聯發散導致數據爆炸案例分析改寫》,作者: Zawami 。
1、【問題描述】
語句中存在同一個表多次自關聯,且均為發散關聯,數據爆炸導致性能瓶頸。
2、【原始SQL】
explain verbose WITH TMP AS ( SELECT WH_ID , (IFNULL(SUBSTR(THE_DATE,1,10),'1900-01-01') || ' ' || STOP_TIME)::TIMESTAMP AS STOP_TIME , (IFNULL(SUBSTR(THE_DATE,1,10),'1900-01-01') || ' ' || '23:59:59')::TIMESTAMP AS MAX_ASD FROM DMISC.DM_DIM_CBG_WH_HOLIDAY_D WHERE IS_OPEN = 'Y' AND STOP_TIME IS NOT NULL ) SELECT T1.WH_ID , T1.THE_DATE , T1.IS_OPEN , MIN(T2.STOP_TIME) AS STOP_TIME , MIN(T2.MAX_ASD) AS TODAY_MAX_ASD , MIN(T3.MAX_ASD) AS NEXT_MAX_ASD FROM (SELECT WH_ID , THE_DATE , IS_OPEN , (IFNULL(SUBSTR(THE_DATE,1,10),'1900-01-01') || ' ' || STOP_TIME)::TIMESTAMP AS STOP_TIME FROM DMISC.DM_DIM_CBG_WH_HOLIDAY_D ) T1 LEFT JOIN TMP T2 ON T1.WH_ID = T2.WH_ID AND T1.THE_DATE < T2.STOP_TIME LEFT JOIN TMP T3 ON T1.WH_ID = T3.WH_ID AND ADDDATE(T1.THE_DATE,1) < T3.STOP_TIME GROUP BY T1.WH_ID, T1.THE_DATE, T1.IS_OPEN;
從SQL中不難看出,物理表HOLIDAY_D使用WH_ID為關聯鍵,並使用其它欄位做不等值關聯。
3、【性能分析】
QUERY PLAN | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| id | operation | E-rows | E-distinct | E-memory | E-width | E-costs | ----+----------------------------------------------------------------------------------+---------------+------------+---------------+---------+----------------- | 1 | -> Row Adapter | 51584 | | | 67 | 377559930171.36 | 2 | -> Vector Streaming (type: GATHER) | 51584 | | | 67 | 377559930171.36 | 3 | -> Vector Hash Aggregate | 51584 | | 16MB | 67 | 377559929546.36 | 4 | -> Vector CTE Append(5, 7) | 5699739636332 | | 1MB | 43 | 292063834485.54 | 5 | -> Vector Streaming(type: BROADCAST) | 757752 | | 2MB | 22 | 1474.87 | 6 | -> CStore Scan on dmisc.dm_dim_cbg_wh_holiday_d [5, CTE tmp(1)] | 757752 | | 1MB | 22 | 1474.87 | 7 | -> Vector Hash Left Join (8, 11) | 5699739636332 | | 107MB(6863MB) | 43 | 292063833010.67 | 8 | -> Vector Hash Right Join (9, 10) | 542231841 | 50 | 16MB | 27 | 22365789.31 | 9 | -> Vector CTE Scan on tmp(1) t3 | 31573 | 50 | 1MB | 48 | 15155.04 | 10 | -> CStore Scan on dmisc.dm_dim_cbg_wh_holiday_d | 51584 | 50 | 1MB | 19 | 556.58 | 11 | -> Vector CTE Scan on tmp(1) t2 | 31573 | 50 | 1MB | 48 | 15155.04 |
由於SQL非常慢,難以打出performance計劃,我們先看verbose計劃。從計劃中我們看到,經過兩次的關聯發散,估計數據量達到了5萬億行;因為hash join根據WH_ID列進行關聯,實際不會有這麼多。所以調優的思路就是取消一些發散,讓中間結果集行數變少。
4、【改寫SQL】
分析SQL,可知發散是為了尋找所有STOP_TIME中大於本行THE_DATE的最小值。像這種每行都需要用到本行數據和所有數據的邏輯,或許可以使用視窗函數進行編寫;但囿於筆者能力,先提供單次自關聯的方法。
SQL改寫如下:
explain performance WITH TMP AS ( SELECT WH_ID , (IFNULL(SUBSTR(THE_DATE,1,10),'1900-01-01') || ' ' || STOP_TIME)::TIMESTAMP AS STOP_TIME , (IFNULL(SUBSTR(THE_DATE,1,10),'1900-01-01') || ' ' || '23:59:59')::TIMESTAMP AS MAX_ASD FROM DMISC.DM_DIM_CBG_WH_HOLIDAY_D WHERE IS_OPEN = 'Y' AND STOP_TIME IS NOT NULL ) SELECT T1.WH_ID , T1.THE_DATE , T1.IS_OPEN , MIN(CASE WHEN T1.THE_DATE < T2.STOP_TIME THEN STOP_TIME ELSE NULL END) AS STOP_TIME , MIN(CASE WHEN T1.THE_DATE < T2.STOP_TIME THEN T2.MAX_ASD ELSE NULL END) AS TODAY_MAX_ASD , MIN(CASE WHEN ADDDATE(T1.THE_DATE, 1) < T2.STOP_TIME THEN T2.MAX_ASD ELSE NULL END) AS NEXT_MAX_ASD FROM (SELECT DISTINCT WH_ID , THE_DATE , IS_OPEN FROM DMISC.DM_DIM_CBG_WH_HOLIDAY_D ) T1 LEFT JOIN TMP T2 ON T1.WH_ID = T2.WH_ID GROUP BY T1.WH_ID , T1.THE_DATE , T1.IS_OPEN ;
經過改寫,取消了一次自關聯,SQL的中間結果集變小。在關聯後,通過條件聚合來得到需要的值。
id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | E-memory | A-width | E-width | E-costs ----+-----------------------------------------------------------------+----------------------+----------+--------+------------+----------------+----------+-----------+---------+---------- 1 | -> Row Adapter | 7490.354 | 34035 | 200 | | 70KB | | | 58 | 15149.80 2 | -> Vector Streaming (type: GATHER) | 7488.129 | 34035 | 200 | | 216KB | | | 58 | 15149.80 3 | -> Vector Hash Aggregate | [7481.430, 7481.430] | 34035 | 200 | | [9MB, 9MB] | 16MB | [112,112] | 58 | 15137.30 4 | -> Vector Hash Left Join (5, 7) | [909.377, 909.377] | 31204164 | 109803 | | [2MB, 2MB] | 16MB | | 34 | 3880.50 5 | -> Vector Sonic Hash Aggregate | [5.876, 5.876] | 34035 | 34036 | 6807 | [3MB, 3MB] | 16MB | [51,51] | 18 | 1127.67 6 | -> CStore Scan on dmisc.dm_dim_cbg_wh_holiday_d | [0.199, 0.199] | 34036 | 34036 | | [792KB, 792KB] | 1MB | | 18 | 532.04 7 | -> CStore Scan on dmisc.dm_dim_cbg_wh_holiday_d | [40.794, 40.794] | 25122 | 21960 | 19 | [1MB, 1MB] | 1MB | [59,59] | 24 | 617.13
從執行計劃中可以看到,中間結果集大小已經在可接受的範圍內。但是又看到聚合3千萬數據使用了6s+的時間,這是過慢的,需要看執行計劃中的DN信息尋找原因 。
Datanode Information (identified by plan id) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 --Row Adapter (actual time=7486.498..7490.354 rows=34035 loops=1) (CPU: ex c/r=107, ex row=34035, ex cyc=3668104, inc cyc=22468059912) 2 --Vector Streaming (type: GATHER) (actual time=7486.466..7488.129 rows=34035 loops=1) (Buffers: shared hit=1) (CPU: ex c/r=660037, ex row=34035, ex cyc=22464391808, inc cyc=22464391808) 3 --Vector Hash Aggregate dn_6083_6084 (actual time=7479.644..7481.430 rows=34035 loops=1) (projection time=4488.807) dn_6083_6084 (Buffers: shared hit=40) dn_6083_6084 (CPU: ex c/r=631, ex row=31204164, ex cyc=19718763112, inc cyc=22443886288) 4 --Vector Hash Left Join (5, 7) dn_6083_6084 (actual time=48.009..909.377 rows=31204164 loops=1) dn_6083_6084 (Buffers: shared hit=36) dn_6083_6084 (CPU: ex c/r=43699, ex row=59157, ex cyc=2585141400, inc cyc=2725123176) 5 --Vector Sonic Hash Aggregate dn_6083_6084 (actual time=5.177..5.876 rows=34035 loops=1) dn_6083_6084 (Buffers: shared hit=11) dn_6083_6084 (CPU: ex c/r=500, ex row=34036, ex cyc=17027544, inc cyc=17619064) 6 --CStore Scan on dmisc.dm_dim_cbg_wh_holiday_d dn_6083_6084 (actual time=0.043..0.199 rows=34036 loops=1) (CU ScanInfo: smallCu: 0, totalCu: 1, avrCuRow: 34036, totalDeadRows: 0) dn_6083_6084 (Buffers: shared hit=11) dn_6083_6084 (CPU: ex c/r=17, ex row=34036, ex cyc=591520, inc cyc=591520) 7 --CStore Scan on dmisc.dm_dim_cbg_wh_holiday_d dn_6083_6084 (actual time=6.464..40.794 rows=25122 loops=1) (filter time=0.872 projection time=33.671) (RoughCheck CU: CUNone: 0, CUTagNone: 0, CUSome: 1) (CU ScanInfo: smallCu: 0, totalCu: 1, avrCuRow: 34036, totalDeadRows: 0) dn_6083_6084 (Buffers: shared hit=25) dn_6083_6084 (CPU: ex c/r=3595, ex row=34036, ex cyc=122362712, inc cyc=122362712)
從中可以看出,所有運算元都只在一個DN上運行了。這可以視為嚴重的計算傾斜,若對單點性能有更高要求需要繼續優化。查看DMISC.DM_DIM_CBG_WH_HOLIDAY_D表的定義,發現它是一個複製表(distribute by replication),在進行各層運算的時候只用其中一個DN來算。而在本SQL中,使用到這張表的時候,關聯鍵都是WH_ID。
再查看調整分佈列為WH_ID的傾斜情況:
select * from pg_catalog.table_skewness('DMISC.DM_DIM_CBG_WH_HOLIDAY_D', 'wh_id');
結果有23行,小於集群DN個數,且存在傾斜。但是本SQL需要使用該表的全量數據,故可以把這張表改為使用WH_ID作為分步鍵進行重分佈。
由表分佈方式為複製表導致的計算傾斜無法使用skew hint解決,可以改變物理表分佈方式或者創建臨時表來解決(複製表通常較小)。由於表在SQL中的使用情況和表的傾斜情況,不適合更改物理表分步鍵為WH_ID,故本例中試使用創建臨時表指定重分佈方式的辦法解決。
DROP TABLE IF EXISTS holiday_d_tmp; CREATE TEMP TABLE holiday_d_tmp WITH ( orientation = COLUMN, compression = low ) distribute BY hash ( wh_id ) AS ( SELECT * FROM DMISC.DM_DIM_CBG_WH_HOLIDAY_D ); EXPLAIN performance WITH TMP AS ( SELECT WH_ID, ( IFNULL ( SUBSTR( THE_DATE, 1, 10 ), '1900-01-01' ) || ' ' || STOP_TIME ) :: TIMESTAMP AS STOP_TIME, ( IFNULL ( SUBSTR( THE_DATE, 1, 10 ), '1900-01-01' ) || ' ' || '23:59:59' ) :: TIMESTAMP AS MAX_ASD FROM holiday_d_tmp WHERE IS_OPEN = 'Y' AND STOP_TIME IS NOT NULL ) SELECT T1.WH_ID, T1.THE_DATE, T1.IS_OPEN, MIN ( CASE WHEN T1.THE_DATE < T2.STOP_TIME THEN STOP_TIME ELSE NULL END ) AS STOP_TIME, MIN ( CASE WHEN T1.THE_DATE < T2.STOP_TIME THEN T2.MAX_ASD ELSE NULL END ) AS TODAY_MAX_ASD, MIN ( CASE WHEN ADDDATE ( T1.THE_DATE, 1 ) < T2.STOP_TIME THEN T2.MAX_ASD ELSE NULL END ) AS NEXT_MAX_ASD FROM ( SELECT WH_ID, THE_DATE, IS_OPEN FROM holiday_d_tmp ) T1 LEFT JOIN TMP T2 ON T1.WH_ID = T2.WH_ID GROUP BY T1.WH_ID, T1.THE_DATE, T1.IS_OPEN;
下麵是對應的執行計劃:
id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | E-memory | A-width | E-width | E-costs ----+--------------------------------------------------------------------------------------+------------------+----------+----------+------------+----------------+----------+---------+---------+---------- 1 | -> Row Adapter | 673.495 | 34035 | 34032 | | 70KB | | | 58 | 68112.60 2 | -> Vector Streaming (type: GATHER) | 671.103 | 34035 | 34032 | | 216KB | | | 58 | 68112.60 3 | -> Vector Hash Aggregate | [0.079, 672.724] | 34035 | 34032 | | [1MB, 1MB] | 16MB | [0,114] | 58 | 67794.10 4 | -> Vector Hash Left Join (5, 6) | [0.047, 76.395] | 31205167 | 27587201 | | [324KB, 485KB] | 16MB | | 34 | 8876.88 5 | -> CStore Scan on pg_temp_cn_5003_6_22022_139764371019520.holiday_d_tmp | [0.004, 0.098] | 34036 | 34036 | 1 | [760KB, 792KB] | 1MB | | 18 | 1553.65 6 | -> CStore Scan on pg_temp_cn_5003_6_22022_139764371019520.holiday_d_tmp | [0.008, 3.253] | 25122 | 22018 | 1 | [880KB, 1MB] | 1MB | [0,61] | 24 | 1557.76
從計劃中我們可以看到,耗時比單個DN運算快了不少,當然這裡沒有算上創建臨時表的時間約0.2s。
5、【調優總結】
在本案例中,因為實際執行SQL時間太長先看了verbose計劃而非performance計劃,發現中間結果集發散問題後,進行等價邏輯改寫,把兩個(等值-不等值)關聯改為一個等值關聯和條件聚合。之後,我們發現SQL因複製表存在計算傾斜問題,考慮SQL消費表數據的方式和表的統計數據,採用了使用臨時表重新指定分佈方式的方法,解決了計算傾斜問題,SQL從單點25min+優化到單點800ms。