NL連接一定是小表驅動大表效率高嗎

GreatSQL社區原創內容未經授權不得隨意使用，轉載請聯繫小編並註明來源。 GreatSQL是MySQL的國產分支版本，使用上與MySQL一致。作者： JennyYu 文章來源：GreatSQL社區原創前言兩表使用nest loop（以下簡稱NL）方式進行連接，小表驅動大表效率高，這似乎是大 ...

GreatSQL社區原創內容未經授權不得隨意使用，轉載請聯繫小編並註明來源。
GreatSQL是MySQL的國產分支版本，使用上與MySQL一致。
作者： JennyYu
文章來源：GreatSQL社區原創

前言

兩表使用nest loop（以下簡稱NL）方式進行連接，小表驅動大表效率高，這似乎是大家的共識，但事實上這是有條件的，並不總是成立。這主要看大表掃描關聯欄位索引後返回多少數據量，是否需要回表，如果大表關聯後返回大量數據，然後再回表，這個代價就會很高，大表處於被驅動表的位置可能就不是最佳選擇了。

實驗舉例

使用benchmarksql壓測的兩個表bmsql_warehouse與bmsql_order_line來測試，初始化10倉數據。

mysql> show create table bmsql_warehouse\G
*************************** 1. row ***************************
       Table: bmsql_warehouse
Create Table: CREATE TABLE `bmsql_warehouse` (
  `w_id` int NOT NULL,
  `w_ytd` decimal(12,2) DEFAULT NULL,
  `w_tax` decimal(4,4) DEFAULT NULL,
  `w_name` varchar(10) DEFAULT NULL,
  `w_street_1` varchar(20) DEFAULT NULL,
  `w_street_2` varchar(20) DEFAULT NULL,
  `w_city` varchar(20) DEFAULT NULL,
  `w_state` char(2) DEFAULT NULL,
  `w_zip` char(9) DEFAULT NULL,
  PRIMARY KEY (`w_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

mysql> show create table bmsql_order_line\G
*************************** 1. row ***************************
       Table: bmsql_order_line
Create Table: CREATE TABLE `bmsql_order_line` (
  `ol_w_id` int NOT NULL,
  `ol_d_id` int NOT NULL,
  `ol_o_id` int NOT NULL,
  `ol_number` int NOT NULL,
  `ol_i_id` int NOT NULL,
  `ol_delivery_d` timestamp NULL DEFAULT NULL,
  `ol_amount` decimal(6,2) DEFAULT NULL,
  `ol_supply_w_id` int DEFAULT NULL,
  `ol_quantity` int DEFAULT NULL,
  `ol_dist_info` char(24) DEFAULT NULL,
  PRIMARY KEY (`ol_w_id`,`ol_d_id`,`ol_o_id`,`ol_number`),
  KEY `ol_stock_fkey` (`ol_supply_w_id`,`ol_i_id`),
  KEY `ol_d_id` (`ol_d_id`),
  CONSTRAINT `ol_order_fkey` FOREIGN KEY (`ol_w_id`, `ol_d_id`, `ol_o_id`) REFERENCES `bmsql_oorder` (`o_w_id`, `o_d_id`, `o_id`),
  CONSTRAINT `ol_stock_fkey` FOREIGN KEY (`ol_supply_w_id`, `ol_i_id`) REFERENCES `bmsql_stock` (`s_w_id`, `s_i_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

查看如下sql的執行計劃與效率：

select  * from bmsql_order_line a join bmsql_warehouse b on a.ol_d_id=b.w_id
 where a.ol_dist_info like 'a%' and b.w_ytd =300000.00;
mysql> explain analyze select  * from bmsql_order_line a join bmsql_warehouse b on a.ol_d_id=b.w_id
    ->  where a.ol_dist_info like 'a%' and b.w_ytd =300000.00;
+--------------------------------------------------------------------+
| EXPLAIN                                                            |
+--------------------------------------------------------------------+
| -> Nested loop inner join  (cost=396352.21 rows=323755) (actual time=11.542..19705.922 rows=115207 loops=1)
    -> Filter: (b.w_ytd = 300000.00)  (cost=1.15 rows=9) (actual time=0.780..0.893 rows=10 loops=1)
        -> Table scan on b  (cost=1.15 rows=9) (actual time=0.743..0.810 rows=10 loops=1)
    -> Filter: (a.ol_dist_info like 'a%')  (cost=12059.95 rows=35973) (actual time=1.401..1969.304 rows=11521 loops=10)
        -> Index lookup on a using ol_d_id (ol_d_id=b.w_id)  (cost=12059.95 rows=323788) (actual time=1.388..1833.176 rows=300209 loops=10)
 |+--------------------------------------------------------------------+
1 row in set (20.31 sec)

從上面的執行計劃看出，優化器選擇小表b表驅動大表a，b表返回10條記錄，屬於小表，a表為被驅動表，每次關聯使用二級索引ol_d_id，掃描索引320209行，回表過濾後剩餘11521行記錄，屬於大表，最終結果集返回115207行數據。使用此計劃耗時20秒左右。

使用hint改變表的連接順序

mysqlb> explain analyze select /*+ join_order(a,b) */  * from bmsql_order_line a join bmsql_warehouse b on a.ol_d_id=b.w_id  where a.ol_dist_info like 'a%' and b.w_ytd =300000.00;
+---------------------------------------------------------------------------+
| EXPLAIN                                                                   |
+---------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=408609.87 rows=323755) (actual time=1.374..4696.931 rows=115207 loops=1)
    -> Filter: (a.ol_dist_info like 'a%')  (cost=295295.55 rows=323755) (actual time=1.036..4614.585 rows=115207 loops=1)
        -> Table scan on a  (cost=295295.55 rows=2914088) (actual time=0.937..4275.678 rows=3002091 loops=1)
    -> Filter: (b.w_ytd = 300000.00)  (cost=0.25 rows=1) (actual time=0.000..0.000 rows=1 loops=115207)
        -> Single-row index lookup on b using PRIMARY (w_id=a.ol_d_id)  (cost=0.25 rows=1) (actual time=0.000..0.000 rows=1 loops=115207)
+----------------------------------------------------------------------------+
1 row in set (4.79 sec)

從上面的執行計劃看出，改變連接順序後，大表a驅動小表b，此計劃執行耗時4秒左右，相比小表b驅動大表a，時間上節省了近80%。由此可見，並不總是小表驅動大表效率高。

其實這屬於兩表關聯，返回大量數據的SQL，在MySQL8.0版本可以控制優化器使用 hash join，走 hash join的效率會比NL要高。忽略兩表關聯欄位上的索引，讓優化器選擇走 hash join。

mysql>  explain analyze select  * from bmsql_order_line a ignore index(ol_d_id) join bmsql_warehouse b ignore index(primary) on a.ol_d_id=b.w_id  where a.ol_dist_info like 'a%' and b.w_ytd =300000.00;
+----------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                             |
+-------------------------------------------------------------------------------------------+
| -> Inner hash join (a.ol_d_id = b.w_id)  (cost=295489.08 rows=3997) (actual time=0.428..3586.047 rows=115207 loops=1)
    -> Filter: (a.ol_dist_info like 'a%')  (cost=29634.41 rows=35973) (actual time=0.155..3549.633 rows=115207 loops=1)
        -> Table scan on a  (cost=29634.41 rows=2914088) (actual time=0.133..2747.262 rows=3002091 loops=1)
    -> Hash
        -> Filter: (b.w_ytd = 300000.00)  (cost=1.15 rows=9) (actual time=0.129..0.156 rows=10 loops=1)
            -> Table scan on b  (cost=1.15 rows=9) (actual time=0.123..0.147 rows=10 loops=1)
 |
+----------------------------------------------------------------------------------+
1 row in set (3.67 sec)

此處註意： 雖然官方文檔上說可以使用BNL與NO_BNL的hint來啟用與禁用 hash join，但是在關聯欄位上有索引的情況下，優化器不會評估 hash join的代價，也就不會選擇 hash join，NO_BNL能夠禁用 hash join，但是BNL並不能嚴格讓優化器選擇 hash join。

如果大表的關聯欄位使用索引覆蓋，不需要回表的情況下執行效率如何呢？

看下麵的SQL的執行計劃，SQL中變換大表a的關聯欄位。

mysql> explain analyze select * from bmsql_order_line a  join bmsql_warehouse b on a.ol_w_id=b.w_id  where a.ol_dist_info like 'a%' and b.w_ytd =300000.00;
+--------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                    |
+--------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=494.86 rows=544) (actual time=0.868..4154.968 rows=115207 loops=1)
    -> Filter: (b.w_ytd = 300000.00)  (cost=1.15 rows=9) (actual time=0.387..0.476 rows=10 loops=1)
        -> Table scan on b  (cost=1.15 rows=9) (actual time=0.363..0.417 rows=10 loops=1)
    -> Filter: (a.ol_dist_info like 'a%')  (cost=1.15 rows=60) (actual time=0.119..414.532 rows=11521 loops=10)
        -> Index lookup on a using PRIMARY (ol_w_id=b.w_id)  (cost=1.15 rows=544) (actual time=0.109..385.753 rows=300209 loops=10)
 |
+-------------------------------------------------------------------------------------------+
1 row in set (4.23 sec)

從上面的執行計劃看出，優化器依然選擇小表b驅動大表a，大表作為被驅動表，使用主鍵進行掃描，不需要回表，在此例子中小表驅動大表與大表驅動小表的執行耗時是差不多的，哪種方式效率高主要看大表過濾後的數據量占全表的百分比，不同的數據量可能就需要選擇不同的方式。

總結

MySQL8.0 有兩種連接方式，選擇NL還是 hash join，要看兩表關聯後返回少量數據還是大量數據，一般情況下，少量數據 NL 優於 hash join，大量數據，hash join 優於 NL。

如果只能選擇NL連接（低於MySQL8.0的版本），那麼在NL 情況下，是小表驅動大表快還是大表驅動小表快，看大表關聯使用的索引是否形成索引覆蓋，及關聯後返回的數據量。

大表關聯使用二級索引，關聯後返回大量數據，又需要回表，這種情況下，一般選擇大表驅動小表效率高些；關聯後返回少量數據，一般選擇小表驅動大表效率高些。

大表關聯使用索引覆蓋，要看大表過濾後的數據量占全表的百分比，不同的數據量可能就需要選擇不同的方式。

不要試圖去記住這些結論，深入瞭解表的連接方式與掃描方式，理解SQL的執行過程，一切都會變得順理成章，我們的人腦會對SQL選擇哪種執行計劃執行效率高有一個清晰的判斷，如果優化器做出錯誤的決策，可以嘗試使用各種優化方式干涉優化器的決策。

Enjoy GreatSQL