作者:LuHengXing 鏈接:http://www.dbapub.cn/2020/09/01/MySQL8.0直方圖/ 查詢優化器負責將SQL查詢轉換為儘可能高效的執行計劃,但隨著數據環境不斷變化,查詢優化器可能無法找到最佳的執行計劃,導致SQL效率低下。造成這種情況的原因是優化器對查詢的數據了 ...
作者:LuHengXing
鏈接:http://www.dbapub.cn/2020/09/01/MySQL8.0直方圖/
查詢優化器負責將SQL查詢轉換為儘可能高效的執行計劃,但隨著數據環境不斷變化,查詢優化器可能無法找到最佳的執行計劃,導致SQL效率低下。造成這種情況的原因是優化器對查詢的數據瞭解的不夠充足,例如:每個表有多少行數據,每列中有多少不同的值,每列的數據分佈情況。
因此MySQL8.0.3推出了直方圖(histogram)功能,直方圖是列的數據分佈的近似值,其向優化器提供更多的統計信息。比如欄位NULL的個數,每個不同值的百分比,最大/最小值等。MySQL的直方圖分為:等寬直方圖和等高直方圖,MySQL會自動分配使用哪種類型的直方圖,無法干預
- 等寬直方圖:每個bucket保存一個值以及這個值的累計頻率
- 等高直方圖:每個bucket保存不同值的個數,上下限以及累計頻率
直方圖同時也存在一定的限制條件:
- 不支持幾何類型以及json類型的列
- 不支持加密表和臨時表
- 無法為單列唯一索引的欄位生成直方圖
創建和刪除直方圖
創建語法
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS;
創建直方圖時能夠同時為多個列創建直方圖,但必須指定bucket數量,範圍在1-1024之間,預設100。對於bucket數量應該綜合考慮其有多少不同值、數據的傾斜度、精度等,建議從較低的值開始,不符合再依次增加。
刪除語法
ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];
直方圖信息
MySQL通過字典表column_statistics來保存直方圖的定義,每行記錄對應一個欄位的直方圖,已JSON格式保存。
root@employees 13:49: select json_pretty(histogram) from information_schema.column_statistics where table_name='employees' and column_name='first_name';;
{
"buckets": [
[
"base64:type254:QWFtZXI=",
"base64:type254:QWRlbA==",
0.010176045588684237,
13
],
"data-type": "string",
"null-values": 0.0,
"collation-id": 255,
"last-updated": "2020-09-09 05:47:32.548874",
"sampling-rate": 0.163495700259278,
"histogram-type": "equi-height",
"number-of-buckets-specified": 100
}
MySQL為employees的first_name欄位分配了等高直方圖,預設為100個bucket。
當生成直方圖時,MySQL會將所有數據都載入到記憶體中,併在記憶體中執行所有工作。如果在大表上生成直方圖,可能會將幾百M的數據讀取到記憶體中的風險,因此我們可以通過參數hitogram_generation_max_mem_size
來控制生成直方圖最大允許的記憶體量,當指定記憶體滿足不了所有數據集時就會採用採樣的方式。
root@employees 14:12: select histogram->>'$."sampling-rate"' from information_schema.column_statistics where table_name='employees' and column_name='first_name';;
+---------------------------------+
| histogram->>'$."sampling-rate"' |
+---------------------------------+
| 0.163495700259278 |
+---------------------------------+
從MySQL8.0.19開始,存儲引擎自身提供了存儲在表中數據的採樣實現,存儲引擎不支持時,MySQL使用預設採樣需要全表掃描,這樣對於大表來說成本太高,採樣實現避免了全表掃描提高採樣性能。
通過INNODB_METRICS計數器可以監視數據頁的採樣情況,這需要提前開啟計數器
root@employees 14:26: SELECT NAME, COUNT FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME LIKE 'sampled%'\G
*************************** 1. row ***************************
NAME: sampled_pages_read
COUNT: 430
*************************** 2. row ***************************
NAME: sampled_pages_skipped
COUNT: 456
2 rows in set (0.04 sec)
採樣率的計算公式為:sampled_page_read/(sampled_pages_read + sampled_pages_skipped)
優化案例
複製一張表出來,源表不添加直方圖,新表添加直方圖
root@employees 14:32: create table employees_like like employees;
Query OK, 0 rows affected (0.03 sec)
root@employees 14:33: insert into employees_like select * from employees;
Query OK, 300024 rows affected (3.59 sec)
Records: 300024 Duplicates: 0 Warnings: 0
root@employees 14:33: ANALYZE TABLE employees_like update HISTOGRAM on birth_date,first_name;
+--------------------------+-----------+----------+-------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+--------------------------+-----------+----------+-------------------------------------------------------+
| employees.employees_like | histogram | status | Histogram statistics created for column 'birth_date'. |
| employees.employees_like | histogram | status | Histogram statistics created for column 'first_name'. |
+--------------------------+-----------+----------+-------------------------------------------------------+
分別在兩張表上查看SQL的執行計劃
root@employees 14:43: explain format=json select count(*) from employees where (birth_date between '1953-05-01' and '1954-05-01') and first_name like 'A%';
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "30214.45"
},
"table": {
"table_name": "employees",
"access_type": "ALL",
"rows_examined_per_scan": 299822,
"rows_produced_per_join": 3700,
"filtered": "1.23",
"cost_info": {
"read_cost": "29844.37",
"eval_cost": "370.08",
"prefix_cost": "30214.45",
"data_read_per_join": "520K"
},
"used_columns": [
"birth_date",
"first_name"
],
"attached_condition": "((`employees`.`employees`.`birth_date` between '1953-05-01' and '1954-05-01') and (`employees`.`employees`.`first_name` like 'A%'))"
}
}
}
root@employees 14:45: explain format=json select count(*) from employees where (birth_date between '1953-05-01' and '1954-05-01') and first_name like 'A%';
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "18744.56"
},
"table": {
"table_name": "employees",
"access_type": "range",
"possible_keys": [
"idx_birth",
"idx_first"
],
"key": "idx_first",
"used_key_parts": [
"first_name"
],
"key_length": "58",
"rows_examined_per_scan": 41654,
"rows_produced_per_join": 6221,
"filtered": "14.94",
"index_condition": "(`employees`.`employees`.`first_name` like 'A%')",
"cost_info": {
"read_cost": "18122.38",
"eval_cost": "622.18",
"prefix_cost": "18744.56",
"data_read_per_join": "874K"
},
"used_columns": [
"birth_date",
"first_name"
],
"attached_condition": "(`employees`.`employees`.`birth_date` between '1953-05-01' and '1954-05-01')"
}
}
}
可以看出Cost值從30214.45降到了18744.56,掃描行數從299822降到了41654,性能有所提升。
參考資料:
https://dev.mysql.com/doc/refman/8.0/en/analyze-table.html#analyze-table-histogram-statistics-analysis
https://mysqlserverteam.com/histogram-statistics-in-mysql/
近期熱文推薦:
1.1,000+ 道 Java面試題及答案整理(2022最新版)
4.別再寫滿屏的爆爆爆炸類了,試試裝飾器模式,這才是優雅的方式!!
覺得不錯,別忘了隨手點贊+轉發哦!