解讀數倉常用模糊查詢的優化方法

摘要：本文講解了GaussDB(DWS)上模糊查詢常用的性能優化方法，通過創建索引，能夠提升多種場景下模糊查詢語句的執行速度。本文分享自華為雲社區《GaussDB(DWS) 模糊查詢性能優化》，作者：黎明的風。在使用GaussDB(DWS)時，通過like進行模糊查詢，有時會遇到查詢性能慢的 ...

摘要：本文講解了GaussDB(DWS)上模糊查詢常用的性能優化方法，通過創建索引，能夠提升多種場景下模糊查詢語句的執行速度。

本文分享自華為雲社區《GaussDB(DWS) 模糊查詢性能優化》，作者：黎明的風。

在使用GaussDB(DWS)時，通過like進行模糊查詢，有時會遇到查詢性能慢的問題。

（一）LIKE模糊查詢

通常的查詢語句如下：

select * from t1 where c1 like 'A123%';

當表t1的數據量大時，使用like進行模糊查詢，查詢的速度非常慢。

通過explain查看該語句生成的查詢計劃：

test=# explain select * from t1 where c1 like 'A123%';
                                 QUERY PLAN 
-----------------------------------------------------------------------------
  id |          operation           | E-rows | E-memory | E-width | E-costs 
 ----+------------------------------+--------+----------+---------+---------
 1 | ->  Streaming (type: GATHER) | 1 | | 8 | 16.25 
 2 | ->  Seq Scan on t1        | 1 | 1MB      | 8 | 10.25 
 Predicate Information (identified by plan id)
 ---------------------------------------------
 2 --Seq Scan on t1
         Filter: (c1 ~~ 'A123%'::text)

查詢計劃顯示對錶t1進行了全表掃描，因此在表t1數據量大的時候執行速度會比較慢。

上面查詢的模糊匹配條件 'A123%'，我們稱它為後模糊匹配。這種場景，可以通過建立一個BTREE索引來提升查詢性能。

建立索引時需要根據欄位數據類型設置索引對應的operator，對於text，varchar和char分別設置和text_pattern_ops，varchar_pattern_ops和bpchar_pattern_ops。

例如上面例子里的c1列的類型為text，創建索引時增加text_pattern_ops，建立索引的語句如下：

CREATE INDEX ON t1 (c1 text_pattern_ops);

增加索引後列印查詢計劃：

test=# explain select * from t1 where c1 like 'A123%';
                                       QUERY PLAN 
----------------------------------------------------------------------------------------
  id |                operation                | E-rows | E-memory | E-width | E-costs 
 ----+-----------------------------------------+--------+----------+---------+---------
 1 | ->  Streaming (type: GATHER)            | 1 | | 8 | 14.27 
 2 | -> Index Scan using t1_c1_idx on t1 | 1 | 1MB      | 8 | 8.27 
             Predicate Information (identified by plan id)             
 ----------------------------------------------------------------------
 2 --Index Scan using t1_c1_idx on t1
 Index Cond: ((c1 ~>=~ 'A123'::text) AND (c1 ~<~ 'A124'::text))
         Filter: (c1 ~~ 'A123%'::text)

在創建索引後，可以看到語句執行時會使用到前面創建的索引，執行速度會變快。

前面遇到的問題使用的查詢條件是尾碼的模糊查詢，如果使用的是首碼的模糊查詢，我們可以看一下查詢計劃是否有使用到索引。

test=# explain select * from t1 where c1 like '%A123';
                                 QUERY PLAN 
-----------------------------------------------------------------------------
  id |          operation           | E-rows | E-memory | E-width | E-costs 
 ----+------------------------------+--------+----------+---------+---------
 1 | ->  Streaming (type: GATHER) | 1 | | 8 | 16.25 
 2 | ->  Seq Scan on t1        | 1 | 1MB      | 8 | 10.25 
 Predicate Information (identified by plan id)
 ---------------------------------------------
 2 --Seq Scan on t1
         Filter: (c1 ~~ '%A123'::text)

如上圖所示，當查詢條件變成首碼的模糊查詢，之前建的索引將不能使用到，查詢執行時進行了全表的掃描。

這種情況，我們可以使用翻轉函數（reverse），建立一個索引來支持前模糊的查詢，建立索引的語句如下：

CREATE INDEX ON t1 (reverse(c1) text_pattern_ops);

將查詢語句的條件採用reverse函數進行改寫之後，輸出查詢計劃：

test=# explain select * from t1 where reverse(c1) like 'A123%';
                                        QUERY PLAN 
------------------------------------------------------------------------------------------
  id |           operation           | E-rows | E-memory | E-width | E-costs 
 ----+-------------------------------+--------+----------+---------+---------
 1 | ->  Streaming (type: GATHER)  | 5 | | 8 | 14.06 
 2 | ->  Bitmap Heap Scan on t1 | 5 | 1MB      | 8 | 8.06 
 3 | ->  Bitmap Index Scan   | 5 | 1MB      | 0 | 4.28 
                      Predicate Information (identified by plan id)                      
 ----------------------------------------------------------------------------------------
 2 --Bitmap Heap Scan on t1
         Filter: (reverse(c1) ~~ 'A123%'::text)
 3 --Bitmap Index Scan
 Index Cond: ((reverse(c1) ~>=~ 'A123'::text) AND (reverse(c1) ~<~ 'A124'::text))

語句經過改寫後，可以走索引，查詢性能得到提升。

（二）指定collate來創建索引

如果使用預設的index ops class時，要使b-tree索引支持模糊的查詢，就需要在查詢和建索引時都指定collate="C"。

註意：索引和查詢條件的collate都一致的情況下才能使用索引。

創建索引的語句為：

CREATE INDEX ON t1 (c1 collate "C");

查詢語句的where條件中需要增加collate的設置：

test=# explain select * from t1 where c1 like 'A123%' collate "C";
                                       QUERY PLAN 
----------------------------------------------------------------------------------------
  id |                operation                | E-rows | E-memory | E-width | E-costs 
 ----+-----------------------------------------+--------+----------+---------+---------
 1 | ->  Streaming (type: GATHER)            | 1 | | 8 | 14.27 
 2 | -> Index Scan using t1_c1_idx on t1 | 1 | 1MB      | 8 | 8.27 
           Predicate Information (identified by plan id)           
 ------------------------------------------------------------------
 2 --Index Scan using t1_c1_idx on t1
 Index Cond: ((c1 >= 'A123'::text) AND (c1 < 'A124'::text))
         Filter: (c1 ~~ 'A123%'::text COLLATE "C")

（三）GIN倒排索引

GIN（Generalized Inverted Index）通用倒排索引。設計為處理索引項為組合值的情況，查詢時需要通過索引搜索出出現在組合值中的特定元素值。例如，文檔是由多個單片語成，需要查詢出文檔中包含的特定單詞。

下麵舉例說明GIN索引的使用方法：

create table gin_test_data(id int, chepai varchar(10), shenfenzheng varchar(20), duanxin text) distribute by hash (id);
create index chepai_idx on gin_test_data using gin(to_tsvector('ngram', chepai)) with (fastupdate=on);

上述語句在車牌的列上建立了一個GIN倒排索引。

如果要根據車牌進行模糊查詢，可以使用下麵的語句：

select count(*) from gin_test_data where to_tsvector('ngram', chepai) @@ to_tsquery('ngram', '湘F');

這個語句的查詢計劃如下：

test=# explain select count(*) from gin_test_data where to_tsvector('ngram', chepai) @@ to_tsquery('ngram', '湘F'); 
                                           QUERY PLAN 
------------------------------------------------------------------------------------------------
  id |                   operation                    | E-rows | E-memory | E-width | E-costs 
 ----+------------------------------------------------+--------+----------+---------+---------
 1 | ->  Aggregate | 1 | | 8 | 18.03 
 2 | ->  Streaming (type: GATHER)                | 1 | | 8 | 18.03 
 3 | ->  Aggregate | 1 | 1MB      | 8 | 12.03 
 4 | ->  Bitmap Heap Scan on gin_test_data | 1 | 1MB      | 0 | 12.02 
 5 | ->  Bitmap Index Scan              | 1 | 1MB      | 0 | 8.00 
                         Predicate Information (identified by plan id)                         
 ----------------------------------------------------------------------------------------------
 4 --Bitmap Heap Scan on gin_test_data
         Recheck Cond: (to_tsvector('ngram'::regconfig, (chepai)::text) @@ '''湘f'''::tsquery)
 5 --Bitmap Index Scan
 Index Cond: (to_tsvector('ngram'::regconfig, (chepai)::text) @@ '''湘f'''::tsquery)

查詢中使用了倒排索引，因此有比較的好的執行性能。

點擊關註，第一時間瞭解華為雲新鮮技術~