本文將對直方圖概念進行介紹,藉助舉例描述直方圖的使用方式,對創建/刪除直方圖的原理進行淺析,並通過例子說明其應用場景。 ...
本文分享自華為雲社區《【MySQL技術專欄】MySQL8.0直方圖介紹》,作者:GaussDB 資料庫。
背景
資料庫查詢優化器負責將SQL查詢轉換為儘可能高效的執行計劃,但因為數據環境不斷變化導致優化器對查詢數據瞭解的不夠充足,可能無法生成最優的執行計划進而影響查詢效率,因此MySQL8.0推出了直方圖(histogram)功能來解決該問題。
直方圖用於統計欄位值的分佈情況,向優化器提供統計信息。利用直方圖,可以對一張表的一列數據做分佈統計,估算where條件中過濾欄位的選擇率,從而幫助優化器更準確地估計查詢過程中的行數,選擇更高效的查詢計劃。
本文將對直方圖概念進行介紹,藉助舉例描述直方圖的使用方式,對創建/刪除直方圖的原理進行淺析,並通過例子說明其應用場景。
MySQL8.0直方圖介紹
資料庫中,查詢優化器所生成執行計劃的好壞關乎執行耗時的多少,優化器若是不清楚表中數據的分佈情況,可能會導致無法生成最優的執行計劃,造成執行時浪費時間。
假設一條SQL語句要查詢相等間隔的兩個不同時間段內出行的人數,若不知道每個時間段內的人數,優化器會假設人數在兩個不同時間段內是均勻分佈的。如果兩個時間段內人數相差較大,這樣優化器估算的統計數據就出現嚴重偏差,從而可能選擇錯誤的執行計劃。那麼,如何使優化器比較清楚地知道數據統計情況進而生成好的執行計劃呢?
一種解決方法就是,在列上建立直方圖,從而近似地獲取一列上的數據分佈情況。利用好直方圖,將會帶來很多方面收益:
(1)查詢優化:提供關於數據分佈的統計信息,幫助優化查詢計劃,選擇合適的索引和優化查詢語句,從而提高查詢性能;
(2)索引設計:通過分析數據的分佈情況,幫助確定哪些列適合創建索引,以提高查詢效率;
(3)數據分析:提供數據的分佈情況,幫助用戶瞭解數據的特征和趨勢。
直方圖分為兩類:等寬直方圖(singleton)和等高直方圖(equi-height)。等寬直方圖是每個桶保存一個值以及這個值累積頻率:
SCHEMA_NAME: xxx//庫名
TABLE_NAME: xxx//表名
COLUMN_NAME: xxx//列名
HISTOGRAM: {
"buckets":[
[
xxx, //桶中數值
xxx //取值頻率
],
......
],
"data-type":"xxx", //數據類型
"null-values":xxx, //是否有NULL值
"collation-id":xxx,
"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx", //更新時間
"sampling-rate":xxx, //採樣率,1表示採集所有數據
"histogram-type":"singleton", //桶類型,等寬
"number-of-buckets-specified":xxx //桶數量
}
等高直方圖每個桶需要保存不同值的個數,上下限以及累積頻率等:
SCHEMA_NAME: xxx
TABLE_NAME: xxx
COLUMN_NAME: xxx
HISTOGRAM: {
"buckets":[
[
xxx, //最小值
xxx, //最大值
xxx, //桶值出現的頻率
xxx //桶值出現的次數
],
......
],
"data-type":"xxx",
"null-values":xxx,
"collation-id":xxx,
"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx",
"sampling-rate":xxx,
"histogram-type":"equi-height", //桶類型,等高
"number-of-buckets-specified":xxx
}
MySQL8.0直方圖使用方式
創建和刪除直方圖時涉及analyze語句,常用語法格式為:
創建直方圖:
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] ... [WITH N BUCKETS]
刪除直方圖:
ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name] ...
具體示例:
mysql> create table t1(c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int,c10 int,c11 int,c12 int,c13 datetime,c14 int,c15 int,c16 int,primary key(c1));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t1 values(1,2,3,4,5,6,7,8,9,10,11,12,'0000-01-01',14,15,16),(2,2,3,4,5,6,7,8,9,10,11,12,'0500-01-01',14,15,16),(3,2,3,4,5,6,7,8,9,10,11,12,'1000-01-01',14,15,16),(4,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16),(5,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16);
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
創建直方圖:
mysql> analyze table t1 update histogram on c13;
+---------+-----------+----------+------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+-----------+----------+------------------------------------------------+
| test.t1 | histogram | status | Histogram statistics created for column 'c13'. |
+---------+-----------+----------+------------------------------------------------+
1 row in set (0.01 sec)
查看直方圖信息:
mysql> select json_pretty(histogram)result from information_schema.column_statistics where table_name = 't1' and column_name = 'c13'\G
*************************** 1. row ***************************
result: {
"buckets": [
[
"0000-01-01 00:00:00.000000", //統計的列值
0.2 //統計的相對頻率,下同
],
[
"0500-01-01 00:00:00.000000",
0.4
],
[
"1000-01-01 00:00:00.000000",
0.6
],
[
"1500-01-01 00:00:00.000000",
1.0
]
],
"data-type": "datetime", //統計的數據類型
"null-values": 0.0, //NULL值的比例
"collation-id": 8, //直方圖數據的排序規則ID
"last-updated": "2023-09-30 16:05:28.533732", //最近更新直方圖的時間
"sampling-rate": 1.0, //直方圖構建採樣率
"histogram-type": "singleton", //直方圖類型,等寬
"number-of-buckets-specified": 100 //桶數量
}
1 row in set (0.00 sec)
刪除直方圖:
mysql> analyze table t1 drop histogram on c13;
+---------+-----------+----------+------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+-----------+----------+------------------------------------------------+
| test.t1 | histogram | status | Histogram statistics removed for column 'c13'. |
+---------+-----------+----------+------------------------------------------------+
1 row in set (0.00 sec)
MySQL8.0直方圖原理淺析
直方圖原理整體框架可概括為下圖所示:
直方圖代碼主要包含在sql/histograms路徑下,帶有equi_height首碼的相關文件涉及等高直方圖,帶有singleton首碼的相關文件涉及等寬直方圖,帶有value_map首碼的相關文件涉及保存統計值結構,histogram.h/histogram.cc涉及直方圖相關調用介面。
Sql_cmd_analyze_table::handle_histogram_command為對直方圖操作的整體處理入口,目前只支持在一張表上進行直方圖相關操作。創建直方圖的主要調用堆棧如下所示,update_histogram為創建直方圖的入口。
mysql_execute_command
->Sql_cmd_analyze_table::execute
->Sql_cmd_analyze_table::handle_histogram_command
->Sql_cmd_analyze_table::update_histogram
->histograms::update_histogram
->prepare_value_maps
->fill_value_maps
->build_histogram
->store_histogram
->dd::cache::Dictionary_client::update
->dd::cache::Storage_adapter::store
->dd::Column_statistics_impl::store_attributes
->histograms::Singleton<xxx>::histogram_to_json
對於創建流程展開描述,prepare_value_maps中主要根據直方圖列類型創建對應的value_map做準備,之後利用histogram_generation_max_mem_size參數值(限制生成直方圖時所允許使用的最大記憶體大小)和單行數據大小計算後控制統計採樣率,fill_value_maps將反覆讀取數據填充到對應類型的value_map中,key為列實際值,value為其出現的次數。調用build_histogram以完成對直方圖的構建,如果桶個數(num_buckets)比不同值個數(value_map.size())要大,則自動創建一個等寬直方圖,否則創建一個等高直方圖。兩種直方圖的創建邏輯分別在Singleton<T>:: build_histogram和Equi_height<T>:: build_histogram中。
構建直方圖完成後調用store_histogram,將結果以JSON的形式存儲在系統表中,通過INFORMATION_SCHEMA.COLUMN_STATISTICS對用戶呈現,histogram_to_json會將直方圖結果轉換為Json_object格式,例如last-updated使用Json_datetime格式保存、histogram-type使用Json_string格式保存、sampling rate使用Json_double格式保存等,再依次調用json_object->add_clone將各json類型欄位保存。
刪除直方圖的主要堆棧如下所示。drop_histograms邏輯中在刪除直方圖前會先嘗試獲取以檢查對應直方圖是否真的存在,不存在的話就提前終止邏輯,存在則刪除。
mysql_execute_command
->Sql_cmd_analyze_table::execute
->Sql_cmd_analyze_table::handle_histogram_command
->Sql_cmd_analyze_table::update_histogram
->histograms::update_histogram
MySQL8.0直方圖優化場景
優化方面,如本文在前所描述的直方圖作用,利用直方圖信息估算where條件中各謂詞的選擇率,幫助選擇最優的執行計劃。例如,表存在如下所示數據傾斜場景。
mysql> select sys_id,order_status,count(*) from my_table_1 group by sys_id,order_status order by 1,2,3;
+--------+--------------+----------+
| sys_id | order_status | count(*) |
+--------+--------------+----------+
| 3 | 1 | 1 |
| 3 | 2 | 200766 |
| 3 | 3 | 3353 |
| 3 | 4 | 1325 |
| 5 | 1 | 13 |
| 5 | 2 | 2478373 |
| 5 | 3 | 43243 |
| 5 | 4 | 13529 |
| 6 | 2 | 171388 |
| 6 | 3 | 254 |
| 6 | 4 | 716 |
+--------+--------------+----------+
執行如下SQL語句時,因為存在數據傾斜而優化器未能準確估計導致執行計劃選擇錯誤,執行耗時約為1.35s。
mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=4163.10 rows=20) (actual time=1350.825..1350.825 rows=0 loops=1)
-> Nested loop left join (cost=4163.10 rows=49) (actual time=1350.825..1350.825 rows=0 loops=1)
-> Filter: ((t1.order_status = 1) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=215.79 rows=49) (actual time=1350.823..1350.823 rows=0 loops=1)
-> Index scan on t1 using PRIMARY (reverse) (cost=215.79 rows=8828) (actual time=0.088..1209.201 rows=2910194 loops=1)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.63 rows=1) (never executed)
通過執行ANALYZE table my_table_1 UPDATE HISTOGRAM ON order_status, sys_id, create_time語句創建直方圖後,再次執行上述SQL語句時,執行計劃中的索引發生了變化,執行耗時為0.11s。因此可以看出,優化器利用更準確的數據分佈信息選擇了更優的執行計劃。
mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=38385.46 rows=20) (actual time=114.217..114.217 rows=0 loops=1)
-> Nested loop left join (cost=38385.46 rows=62764) (actual time=114.216..114.216 rows=0 loops=1)
-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=28200.86 rows=62668) (actual time=114.215..114.215 rows=0 loops=1)
-> Filter: (t1.order_status = 1) (cost=28200.86 rows=62668) (actual time=114.207..114.207 rows=0 loops=1)
-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=28200.86 rows=62668) (actual time=0.326..112.912 rows=31142 loops=1)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.62 rows=1) (never executed)
另外,當where條件中變數值不同時,優化器也根據數據分佈情況選擇了準確的執行計劃,使得執行效率提高。
mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (2) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=13541.27 rows=20)
-> Nested loop left join (cost=13541.27 rows=44)
-> Filter: ((t1.order_status = 2) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=15.79 rows=44)
-> Index scan on t1 using PRIMARY (reverse) (cost=15.79 rows=338)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)
1 row in set (0.00 sec)
mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (4) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=30559.31 rows=20)
-> Nested loop left join (cost=30559.31 rows=55852)
-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=24966.26 rows=55480)
-> Filter: (t1.order_status = 4) (cost=24966.26 rows=55480)
-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=24966.26 rows=55480)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)
1 row in set (0.00 sec)
所以,通過所提供的統計信息,幫助優化查詢計划進而提高查詢性能是如前所述應用直方圖的一個收益點。