GreatSQL 優化技巧：將 MINUS 改寫為標量子查詢前言 minus 指令運用在兩個 SQL 語句上，取兩個語句查詢結果集的差集。它先找出第一個 SQL 所產生的結果，然後看這些結果有沒有在第二個 SQL 的結果中，如果在，那這些數據就被去除，不會在最後的結果中出現，第二個 SQL 結果集 ...

GreatSQL 優化技巧：將 MINUS 改寫為標量子查詢

前言

minus 指令運用在兩個 SQL 語句上，取兩個語句查詢結果集的差集。它先找出第一個 SQL 所產生的結果，然後看這些結果有沒有在第二個 SQL 的結果中，如果在，那這些數據就被去除，不會在最後的結果中出現，第二個 SQL 結果集比第一個SQL結果集多的數據也會被拋棄。
這兩天的優化工作中遇到這樣一種案例，第一個SQL語句結果集很小，第二個SQL語句結果集很大，這種情況下我們怎麼來優化處理呢？

實驗

創建測試表

CREATE TABLE t1(id int primary key auto_increment,
subscriber_id decimal(20, 0) not null,
member_num varchar(20) not null,
effectdate datetime,
expirydate datetime,
create_date datetime,
key idx_subscriber(subscriber_id));

創建存儲過程，向測試插入50萬數據。（實際生產案例中表中數據有幾千萬）

註意下麵的存儲過程中，是GreatSQL在Oracle模式下創建的，GreatSQL實現了大量的Oracle語法相容，比如下麵存儲過程中遇到的日期加減，add_months函數，while loop迴圈等，資料庫由Oracle向GreatSQL遷移時，會節省很多代碼改造工作。

set sql_mode=oracle;
delimiter //
create or replace procedure p1() as
 p1 int :=1;
 n1 int;
 d1 datetime;
begin
   while p1<=500000 loop
       n1:=round(rand()*500000);
       d1:=to_date('2016-01-01','yyyy-mm-dd') + round(rand()*3000);
       insert into t1(subscriber_id,member_num,effectdate,expirydate,create_date) values(n1,concat('m_',n1),last_day(d1)+1,add_months(last_day(d1)+1,100),d1);
       set p1=p1+1;
   end loop;
end;
//
delimiter ;

這個表create_date列的數據是從2016年1月到2024年3月的數據，使用了隨機值，保證每個月的數據量相近，subscriber_id也是隨機值生成的，選擇性很好，這個模型數據與生產環境差不多。

執行下麵這個SQL語句：

SELECT DISTINCT subscriber_id, member_num
  FROM t1 
 WHERE create_date >= '2024-02-01'
   AND create_date < '2024-03-01'
   AND to_char(effectdate, 'yyyymm') > '202402'
minus
SELECT DISTINCT subscriber_id, member_num
  FROM t1 
 WHERE 202402 BETWEEN to_char(effectdate, 'yyyymm') AND
       to_char(expirydate, 'yyyymm');

這條SQL是根據生產環境使用的語句簡化而來的，只為突出本文要說明的知識點。

此SQL的執行計劃如下：

greatsql> explain analyze
    -> select distinct subscriber_id, member_num
    ->   from t1
    ->  where create_date >= '2024-02-01'
    ->    and create_date < '2024-03-01'
    ->    and to_char(effectdate, 'yyyymm') > '202402'
    -> minus
    -> select distinct subscriber_id, member_num
    ->   from t1
    ->  where 202402 between to_char(effectdate, 'yyyymm') and
    ->        to_char(expirydate, 'yyyymm')\G
*************************** 1. row ***************************
EXPLAIN: -> Table scan on <except temporary>  (cost=168492.31..169186.99 rows=55375) (actual time=2420.123..2420.896 rows=1758 loops=1)
    -> Except materialize with deduplication  (cost=168492.30..168492.30 rows=55375) (actual time=2420.121..2420.121 rows=4855 loops=1)
        -> Table scan on <temporary>  (cost=55858.24..56552.91 rows=55375) (actual time=221.965..223.384 rows=4855 loops=1)
            -> Temporary table with deduplication  (cost=55858.23..55858.23 rows=55375) (actual time=221.962..221.962 rows=4855 loops=1)
                -> Filter: ((t1.create_date >= TIMESTAMP'2024-02-01 00:00:00') and (t1.create_date < TIMESTAMP'2024-03-01 00:00:00') and (to_char(t1.effectdate,'yyyymm') > '202402'))  (cost=50320.70 rows=55375) (actual time=0.118..217.497 rows=4875 loops=1)
                    -> Table scan on t1  (cost=50320.70 rows=498477) (actual time=0.084..179.826 rows=500000 loops=1)
        -> Table scan on <temporary>  (cost=100168.41..106401.86 rows=498477) (actual time=1520.965..1571.682 rows=307431 loops=1)
            -> Temporary table with deduplication  (cost=100168.40..100168.40 rows=498477) (actual time=1520.963..1520.963 rows=307431 loops=1)
                -> Filter: (202402 between to_char(t1.effectdate,'yyyymm') and to_char(t1.expirydate,'yyyymm'))  (cost=50320.70 rows=498477) (actual time=0.123..934.617 rows=492082 loops=1)
                    -> Table scan on t1  (cost=50320.70 rows=498477) (actual time=0.104..716.919 rows=500000 loops=1)

1 row in set (2.47 sec)

從執行計劃看出，SQL總體耗時2.47s。第一部分的查詢結果集有4855條，耗時221.962ms，第二部分的查詢結果集有307431條，耗時1571.682ms。

優化分析：

首先第一部分create_date加上索引會提升查詢效率，因為只需要查詢一個月的數據，而此SQL耗時最多的是第二部分，重在第二部分的優化處理。

第二部分查詢結果集在做minus運算時大部分記錄都是要被拋棄的，查詢出來再被拋棄相當於做了無用功，而SQL優化的核心思想就是在於減少IO，那我們要做的就是想辦法省去第二部分SQL的全面查詢，只需要驗證第一部分的查詢結果集是否在第二部分查詢結果中存在就好了。

那如何驗證呢?

把第一部分select的列值傳給第二部分作為where條件去查找，只要能查到，無論幾條都算在第二部分存在，這部分數據就要被去除，查不到就是在第二部分不存在，數據保留在最終結果集。根據這個邏輯我想到了標量子查詢的妙用。

標量子查詢改寫參考：

select distinct subscriber_id, member_num
  from (select a.subscriber_id,
               a.member_num,
               (select count(*) cnt
                  from t1 b
                 where a.subscriber_id = b.subscriber_id
                   and a.member_num = b.member_num
                   and 202402 between to_char(effectdate, 'yyyymm') and
                       to_char(expirydate, 'yyyymm')) as cnt
          from t1 a
         where create_date >= '2024-02-01'
           and create_date < '2024-03-01'
           and to_char(effectdate, 'yyyymm') > '202402')
 where cnt = 0

改後SQL的執行計劃如下：

greatsql> explain analyze
    -> select distinct subscriber_id, member_num
    ->   from (select a.subscriber_id,
    ->                a.member_num,
    ->                (select count(*) cnt
    ->                   from t1 b
    ->                  where a.subscriber_id = b.subscriber_id
    ->                    and a.member_num = b.member_num
    ->                    and 202402 between to_char(effectdate, 'yyyymm') and
    ->                        to_char(expirydate, 'yyyymm')) as cnt
    ->           from t1 a
    ->          where create_date >= '2024-02-01'
    ->            and create_date < '2024-03-01'
    ->            and to_char(effectdate, 'yyyymm') > '202402')
    ->  where cnt = 0\G
*************************** 1. row ***************************
EXPLAIN: -> Table scan on <temporary>  (cost=3172.53..3235.95 rows=4875) (actual time=168.555..168.775 rows=1758 loops=1)
    -> Temporary table with deduplication  (cost=3172.51..3172.51 rows=4875) (actual time=168.553..168.553 rows=1758 loops=1)
        -> Index lookup on alias_temp_-1556603461854822391 using <auto_key0> (cnt=0)  (cost=2681.86..2685.01 rows=10) (actual time=166.656..167.178 rows=1765 loops=1)
            -> Materialize  (cost=2681.51..2681.51 rows=4875) (actual time=166.649..166.649 rows=4875 loops=1)
                -> Filter: (to_char(a.effectdate,'yyyymm') > '202402')  (cost=2194.01 rows=4875) (actual time=0.380..45.477 rows=4875 loops=1)
                    -> Index range scan on a using idx_creatdate over ('2024-02-01 00:00:00' <= create_date < '2024-03-01 00:00:00'), with index condition: ((a.create_date >= TIMESTAMP'2024-02-01 00:00:00') and (a.create_date < TIMESTAMP'2024-03-01 00:00:00'))  (cost=2194.01 rows=4875) (actual time=0.344..43.143 rows=4875 loops=1)
                -> Select #3 (subquery in projection; dependent)
                    -> Aggregate: count(0)  (cost=0.42 rows=1) (actual time=0.022..0.022 rows=1 loops=4875)
                        -> Filter: ((a.member_num = b.member_num) and (202402 between to_char(b.effectdate,'yyyymm') and to_char(b.expirydate,'yyyymm')))  (cost=0.40 rows=0.2) (actual time=0.019..0.021 rows=1 loops=4875)
                            -> Index lookup on b using idx_subscriber (subscriber_id=a.subscriber_id)  (cost=0.40 rows=2) (actual time=0.018..0.019 rows=2 loops=4875)

1 row in set, 2 warnings (0.26 sec)

從執行計劃可以看出，子查詢執行次數依賴於主查詢，執行了4875次，因為subscriber_id列選擇性很好，所以每次查詢效率很高。SQL總體耗時0.26秒，而原SQL耗時2.47s，性能提升了將近10倍。在實際生產案例中第二部分結果集有5000萬左右，第一部分結果集只有幾十條，SQL執行半天都跑不出結果，改造後幾乎秒出。

提醒一點，註意NULL值比較，當select 列表中的部分列存在NULL值時就不能直接用等號（=）關聯來判斷了，得用is NULL來判斷，本案例不涉及此問題，語句是否等價有時需要結合業務，具體情況具體分析。

結論：

本文提供了一種minus語句的優化方法，將minus轉化為標量子查詢表達，這種優化方式適用於第一部分查詢結果集比較小，查詢的列比較少的情況，且要結合業務判讀是否需要對NULL值進行判斷。優化時一般避免使用標量子查詢，因為標量子查詢會構造天然的嵌套迴圈連接，但也並不是說標量子查詢一定不可用，還是要從根兒上考慮，優化核心思想，減少IO是要點。

Enjoy GreatSQL