GreatSQL優化技巧：半連接(semijoin)優化

何為半連接？半連接是在GreatSQL內部採用的一種執行子查詢的方式，semi join不是語法關鍵字，不能像使用inner join、left join、right join這種語法關鍵字一樣提供給用戶來編寫SQL語句。兩個表t1表和t2表進行半連接的含義是：對於t1表的某條記錄來說，我們只關 ...

何為半連接？

半連接是在GreatSQL內部採用的一種執行子查詢的方式，semi join不是語法關鍵字，不能像使用inner join、left join、right join這種語法關鍵字一樣提供給用戶來編寫SQL語句。

兩個表t1表和t2表進行半連接的含義是：對於t1表的某條記錄來說，我們只關心在t2表中是否存在與之匹配的記錄，而不關心有多少條記錄與之匹配，最終的結果集中只保留t1表的記錄。

前面文章也提到過，含in、exists子查詢的語句通常會採用半連接方式執行查詢，但這不絕對，也有一些情況不適用半連接。比如:

(1)外查詢的where子句中，存在其他搜索條件使用OR操作符與IN子查詢的條件連接起來

(2)IN子查詢位於Select子句中

(3)IN子查詢中含有union的情況

(4)IN子查詢中含group by、having或聚合函數的情況

GreatSQL執行半連接的優化策略

本文實驗使用資料庫版本為GreatSQL 8.0.32-25。

創建兩張實驗表來說明。

greatsql> create table t1(
c1 varchar(30),
c2 int
);
greatsql> create table t2(
id int primary key,
c1 varchar(30),
key idx_c1(c1)
);
--插入幾條測試數據
greatsql> insert into t1 values('a',1);
greatsql> insert into t1 values('b',3);
greatsql> insert into t1 values('a',5);
greatsql> insert into t1 values('c',7);
greatsql> insert into t1 values('d',9);
greatsql> insert into t2 values(1,'a');
greatsql> insert into t2 values(2,'a');
greatsql> insert into t2 values(3,'b');
greatsql> insert into t2 values(4,'b');
greatsql> insert into t2 values(5,'c');
greatsql> insert into t2 values(6,'b');

GreatSQL執行半連接的方式大致有以下5種：

1.Table pullout（子查詢中的表上拉）

當子查詢的查詢列表處只有主鍵或者唯一索引列時，可以直接把子查詢中的表上拉到外層查詢的FROM子句中，並把子查詢的查詢條件合併到外層查詢的搜索條件中。所以選擇這種方式是有先決條件的，子查詢的查詢列表處必須只有主鍵或唯一索引列。有沒有選擇這種方式，可以通過執行explain展示計劃後，使用show warnings命令查看優化器改寫後的語句。

例如下麵這個語句：

select * from t1 where c2 in (select id from t2 where t2.c1='b');

這個語句種子查詢的id列是t2表的主鍵列，滿足這種方式的先決條件，看一下執行計劃。

greatsql> explain select * from t1 where c2 in (select id from t2 where t2.c1='b');
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys  | key    | key_len | ref   | rows | filtered | Extra                                      |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | ref  | PRIMARY,idx_c1 | idx_c1 | 123     | const |    3 |   100.00 | Using index                                |
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL           | NULL   | NULL    | NULL  |    4 |    25.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                            |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t2` join `test`.`t1` where ((`test`.`t1`.`c2` = `test`.`t2`.`id`) and (`test`.`t2`.`c1` = 'b')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

從warning信息可以看出，優化器改執行連接方式是，t1表與t2表通過內連接來關聯，原子查詢內部t2表的過濾條件放到了整個語句where條件的後面，原語句與優化器執行的語句之所以等價，是因為子查詢的查詢列id列是主鍵列，不會有重覆值，跟外表t1使用inner join連接後，不會造成關聯後結果集數據量的放大。一般情況下子查詢的查詢列表處只有主鍵或者唯一索引列時都會轉化為這種方式來執行。對於這種業務，無論開發者怎麼編寫SQL，使用inner join 也好，exists也好，最後優化器執行方式可能都是一樣的。

可以看一下將原語句改造為inner join 與 exists語句的執行計劃，是不是都是一樣的。

greatsql> explain select * from t1 where exists (select 1 from t2 where t2.id=t1.c2 and t2.c1='b');
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys  | key    | key_len | ref   | rows | filtered | Extra                                      |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | ref  | PRIMARY,idx_c1 | idx_c1 | 123     | const |    3 |   100.00 | Using index                                |
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL           | NULL   | NULL    | NULL  |    4 |    25.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
2 rows in set, 2 warnings (0.00 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                            |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1276 | Field or reference 'test.t1.c2' of SELECT #2 was resolved in SELECT #1                                                                                                                           |
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t2` join `test`.`t1` where ((`test`.`t1`.`c2` = `test`.`t2`.`id`) and (`test`.`t2`.`c1` = 'b')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

greatsql> explain select t1.* from t1 inner join t2 on t1.c2=t2.id where t2.c1='b';
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys  | key    | key_len | ref   | rows | filtered | Extra                                      |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | ref  | PRIMARY,idx_c1 | idx_c1 | 123     | const |    3 |   100.00 | Using index                                |
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL           | NULL   | NULL    | NULL  |    4 |    25.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                            |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` join `test`.`t2` where ((`test`.`t1`.`c2` = `test`.`t2`.`id`) and (`test`.`t2`.`c1` = 'b')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

這種執行方式本質上已經轉換為內連接了。

2.FirstMatch（首次匹配）

這種方式先取外層查詢的一條記錄，到子查詢的表中尋找符合匹配條件的記錄，如果能找到一條，則將外層查詢的記錄放入到最終結果集中並且停止查找匹配更多的記錄，如果找不到，則把該外層查詢的記錄丟棄掉，然後再開始取下一條外層查詢中的記錄，這個過程一直持續到外層查詢獲取不到記錄為止。

看一個簡單語句的執行計劃

select * from t1 where c1 in (select c1 from t2);

greatsql> explain select * from t1 where c1 in (select c1 from t2);
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+-------------------------------+
| id | select_type | table | partitions | type | possible_keys | key    | key_len | ref          | rows | filtered | Extra                         |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+-------------------------------+
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL          | NULL   | NULL    | NULL         |    4 |   100.00 | Using where                   |
|  1 | SIMPLE      | t2  | NULL       | ref  | idx_c1        | idx_c1 | 123     | test.t1.c1 |    2 |   100.00 | Using index; FirstMatch(t1) |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+-------------------------------+
2 rows in set, 1 warning (0.01 sec)

greatsql> show warnings;
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                  |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`test`.`t2`.`c1` = `test`.`t1`.`c1`) |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

從warning信息可以看到 semi join 的字樣，優化器使用半連接方式執行的子查詢。從執行計劃可以看到 extra 列有FirstMatch(t1) 的字樣，表示對t1表外查詢傳入的每個c1值在t2表上都進行了首次匹配，這種方式也是我最初理解的in子查詢的含義，只關心有無匹配上，不關心匹配上多少。

3.LooseScan（鬆散掃描）

LooseScan是使用子查詢的查詢列上的索引，只針對相同索引列值的第一條記錄，去外查詢找對應的記錄。使用了這種優化方式的半連接，在explain的計劃的Extra列會有LooseScan字樣。

還是上面的語句，使用semijoin的hint干涉優化器，使其選擇LooseScan的優化策略。

select /*+ semijoin(@subq1 loosescan) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );

greatsql> explain select /*+ semijoin(@subq1 loosescan) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );            
+----+-------------+-------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key    | key_len | ref  | rows | filtered | Extra                                      |
+----+-------------+-------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | index | idx_c1        | idx_c1 | 123     | NULL |    6 |    50.00 | Using index; LooseScan                     |
|  1 | SIMPLE      | t1  | NULL       | ALL   | NULL          | NULL   | NULL    | NULL |    5 |    20.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

greatsql> show warnings;
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                      |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select /*+ SEMIJOIN(@`subq1` LOOSESCAN) */ `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`test`.`t1`.`c1` = `test`.`t2`.`c1`) |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

從執行計劃可以看出，子查詢的表t2作為驅動表，t2表的c1列上有索引，對錶t2進行訪問時，使用其c1列的索引，對相同的索引列值只取第一條記錄去t1表中找對應記錄，將所有外查詢表t1對應的記錄都加入到最終結果集，可以理解為對子查詢t2表的索引掃描方式是跳躍式的。

4.Duplicate Weedout重覆值消除

這種方式是藉助臨時表來消除重覆值，explain展示計劃時，在extra列會出現Start temporary 和 End temporary的字樣。

還是上面的語句，我們使用semijoin的hint干涉優化器，使其選擇dupsweedout優化策略。

greatsql> explain select /*+ semijoin(@subq1 dupsweedout)*/ * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2);
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+---------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key    | key_len | ref          | rows | filtered | Extra                                       |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+---------------------------------------------+
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL          | NULL   | NULL    | NULL         |    4 |   100.00 | Using where                                 |
|  1 | SIMPLE      | t2  | NULL       | ref  | idx_c1        | idx_c1 | 123     | test.t1.c1 |    2 |   100.00 | Using index; Start temporary; End temporary |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+---------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                        |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select /*+ SEMIJOIN(@`subq1` DUPSWEEDOUT) */ `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`test`.`t2`.`c1` = `test`.`t1`.`c1`) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

例如：t1表的記錄（'b',3）,可以匹配上t2表的兩條記錄(3,'b'),(4,'b')，為了消除關聯結果的重覆值，可以想象建立這樣一個臨時表：

create table tmp(rowid int primary key);

當把t1表的記錄加入到結果集時，先把這條記錄的rowid加入到臨時表中，如果添加成功，說明這條記錄並沒有加入到最後的結果集，如果添加失敗，則說明t1表的這條記錄已經加入到最終結果集了

個人感覺這種方式比其他方式效率低。

5.Semi-join Materialization（半連接物化）

先把IN 子句中的不相關子查詢進行物化，然後再將外層查詢的表與物化表進行連接。子查詢內部有分組聚合運算時通常會先進行物化處理。

還是上面的語句，使用semijoin的hint干涉優化器，使其選擇materialization的優化策略。

select /*+ semijoin(@subq1 materialization) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );

greatsql> explain select /*+ semijoin(@subq1 materialization) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
| id | select_type  | table       | partitions | type   | possible_keys       | key                 | key_len | ref          | rows | filtered | Extra       |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
|  1 | SIMPLE       | t1        | NULL       | ALL    | NULL                | NULL                | NULL    | NULL         |    5 |   100.00 | Using where |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 123     | test.t1.c1 |    1 |   100.00 | NULL        |
|  2 | MATERIALIZED | t2        | NULL       | index  | idx_c1              | idx_c1              | 123     | NULL         |    6 |   100.00 | Using index |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

greatsql> show warnings;
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                            |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select /*+ SEMIJOIN(@`subq1` MATERIALIZATION) */ `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`<subquery2>`.`c1` = `test`.`t1`.`c1`) |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

從執行計劃可以看出，先對子查詢t2表做了物化表處理，物化表會生成自動索引<auto_distinct_key>，外查詢表t1再與物化表做Nest loop連接。

補充說明

對於上面的語句 select * from t1 where c1 in (select c1 from t2);，優化器預設選擇了firstmatch方式，其他方式都是使用hint來干涉的優化器的選擇，可以看到這個hint包含兩部分，一個是使用qb_name()給子查詢分配一個名稱，一個是使用semijoin([@query_block_name] [strategy]),指定子查詢塊使用半連接策略，可以指定多個策略。同時semijoin的優化策略的選擇還受優化開關參數optimize_switch的影響，該參數里有semijoin，loosescan，firstmatch，duplicateweedout的開關，預設都是開啟的，所以也可以使用優化開關來干涉優化器的選擇。

優化舉例

select count(*)
  from t1 a
 where substr(a.modifytime, 1, 8) = '20240301'
   and a.sospecnumber in
       (select a.sospecnumber
          from t1 a
         where substr(a.modifytime, 1, 8) < '20240301');

這條SQL只涉及一張表t1，表中數據200萬左右，modify_time為字元類型，存儲從2009年開始的時間串。看一下該表的索引情況。

greatsql> show index from t1;
+-------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| t1    |          1 | idx_sospecnumber |            1 | SOSPECNUMBER | A         |         133 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| t1    |          1 | idx_modifytime   |            1 | MODIFYTIME   | A         |      634186 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+-------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)

explain的執行計劃如下：

greatsql> explain
    -> select count(*)
    ->   from t1 a
    ->  where substr(a.modifytime, 1, 8) ='20240301'
    ->    and a.sospecnumber  in
    ->        (select a.sospecnumber
    ->           from t1 a
    ->          where substr(a.modifytime, 1, 8) < '20240301') ;
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+---------------------+---------+----------+-------------+
| id | select_type  | table       | partitions | type   | possible_keys       | key                 | key_len | ref                 | rows    | filtered | Extra       |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+---------------------+---------+----------+-------------+
|  1 | SIMPLE       | a           | NULL       | ALL    | idx_sospecnumber    | NULL                | NULL    | NULL                | 2426414 |   100.00 | Using where |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 131     | test.a.SOSPECNUMBER |       1 |   100.00 | NULL        |
|  2 | MATERIALIZED | a           | NULL       | ALL    | idx_sospecnumber    | NULL                | NULL    | NULL                | 2426414 |   100.00 | Using where |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+---------------------+---------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

優化器選擇的半連接優化策略是物化的方式。

explain analyze的實際計劃如下：

greatsql> explain analyze
    -> select count(*)
    ->   from t1 a
    ->  where substr(a.modifytime, 1, 8) ='20240301'
    ->    and a.sospecnumber  in
    ->        (select a.sospecnumber
    ->           from t1 a
    ->          where substr(a.modifytime, 1, 8) < '20240301') \G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=1177497474524.58 rows=1) (actual time=4442.499..4442.500 rows=1 loops=1)
    -> Nested loop inner join  (cost=588748984584.98 rows=5887484899396) (actual time=4438.967..4442.408 rows=1346 loops=1)
        -> Filter: ((substr(a.MODIFYTIME,1,8) = '20240301') and (a.SOSPECNUMBER is not null))  (cost=252003.98 rows=2426414) (actual time=1550.096..1552.027 rows=1346 loops=1)
            -> Table scan on a  (cost=252003.98 rows=2426414) (actual time=0.050..1189.136 rows=2493198 loops=1)
        -> Single-row index lookup on <subquery2> using <auto_distinct_key> (sospecnumber=a.SOSPECNUMBER)  (cost=494645.48..494645.48 rows=1) (actual time=2.147..2.147 rows=1 loops=1346)
            -> Materialize with deduplication  (cost=494645.38..494645.38 rows=2426414) (actual time=2888.845..2888.845 rows=165 loops=1)
                -> Filter: (a.SOSPECNUMBER is not null)  (cost=252003.98 rows=2426414) (actual time=0.215..1927.315 rows=2487547 loops=1)
                    -> Filter: (substr(a.MODIFYTIME,1,8) < '20240301')  (cost=252003.98 rows=2426414) (actual time=0.214..1745.562 rows=2487547 loops=1)
                        -> Table scan on a  (cost=252003.98 rows=2426414) (actual time=0.211..1235.738 rows=2493198 loops=1)

1 row in set (4.45 sec)

優化分析：

這條SQL總體耗時4.45s，耗時主要分佈在兩處：

一處消耗在外表的查詢，對t1進行了全表掃描，回表過濾後剩餘1346行數據，耗時1552ms，此處雖然modifytime列有索引，但是因為在條件列上施加了substr函數，導致索引用不上，改為modifytime like '20240301%'的方式，也表示了查詢2024年3月1日的數據，同時用上了索引。

另一處消耗在子查詢的物化上，子查詢結果集有2487547行數據，表掃描、過濾、物化整個過程耗時約2888ms，對大結果集進行物化消耗比較大，同時IN子查詢的查詢列sospecnumber列上是有索引的，雖然選擇性不好，但是這個子查詢的含義是只需要判斷子查詢結果集中有無記錄能匹配上，而不關心匹配上多少條，所以這種情況採用first match方式比較好。

SQL改寫如下：

select /*+ semijoin(@subq firstmatch)*/
 count(*)
  from t1 a
 where a.modifytime like '20240301%'
   and a.sospecnumber in
       (select /*+ qb_name(subq)*/
         a.sospecnumber
          from t1 a
         where substr(a.modifytime, 1, 8) < '20240301')

改寫後執行計劃如下：

*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=11052513.72 rows=1) (actual time=157.570..157.570 rows=1 loops=1)
    -> Nested loop semijoin  (cost=8596909.70 rows=24556040) (actual time=0.203..157.450 rows=1346 loops=1)
        -> Filter: (a.SOSPECNUMBER is not null)  (cost=606.05 rows=1346) (actual time=0.057..7.610 rows=1346 loops=1)
            -> Index range scan on a using idx_modifytime over ('20240301' <= MODIFYTIME <= '20240301????????????????????????????????????????????????'), with index condition: (a.MODIFYTIME like '20240301%')  (cost=606.05 rows=1346) (actual time=0.055..7.406 rows=1346 loops=1)
        -> Filter: (substr(a.MODIFYTIME,1,8) < '20240301')  (cost=83255911.06 rows=18244) (actual time=0.111..0.111 rows=1 loops=1346)
            -> Index lookup on a using idx_sospecnumber (SOSPECNUMBER=a.SOSPECNUMBER)  (cost=83255911.06 rows=18244) (actual time=0.111..0.111 rows=1 loops=1346)

1 row in set, 1 warning (0.16 sec)

改寫後耗時0.16s，性能提升近30倍，在對子查詢通過索引idx_sospecnumber搜索數據時，查到一條就會停止繼續搜索了。

結語

GreatSQL的 IN 子查詢適用於半連接時，優化器提供了5種優化策略：Table pullout、FirstMatch、LooseScan、Duplicate weedout、materialize。

一般外查詢表結果集小，子查詢結果集太大時，不希望通過物化這種方式來執行連接，因為物化表的代價太大，可能通過FirstMatch或者LooseScan很快就可以執行出結果了。那反之外查詢結果集大，子查詢結果集小時，通過物化表這種方式可能就會取得很好的效果。很多時候都不用過多干涉優化器做選擇，但是如果懂得原理，當優化器選錯的時候我們也可以通過hint來穩定計劃，讓SQL保持高效的執行。

Enjoy GreatSQL