Postgres全文搜索功能_ZenDei技術網路在線

當構建一個Web應用時，經常被要求加上搜索功能。其實有時候我們也不知道我要搜索個啥，反正就是要有這個功能。搜索確實很重要的特性，這也是為什麼像Elasticsearch和Solr這樣基於Lucene的資料庫越來越流行。這兩個是好用，但是在構建Web應用時，有時候感覺像是殺雞用牛刀。所以我們需要選擇輕 ...

當構建一個Web應用時，經常被要求加上搜索功能。其實有時候我們也不知道我要搜索個啥，反正就是要有這個功能。搜索確實很重要的特性，這也是為什麼像Elasticsearch和Solr這樣基於Lucene的資料庫越來越流行。這兩個是好用，但是在構建Web應用時，有時候感覺像是殺雞用牛刀。所以我們需要選擇輕量好用的東西來搜索。

如何定義足夠好用？ 足夠好用的搜索引擎需要有以下幾點特征：

Stemming 關鍵詞，詞乾
Ranking/Boost 排名和權重分配
多語言支持
錯誤糾正/模糊查詢
口音支持類似我們的一二三四聲

非常幸運，PostgreSql支持以上所有特征。
這篇文章旨在幫助以下的人群：

用了PG資料庫，不想再加一個搜索引擎
用了其他資料庫，想有一個更好的全文檢索體驗

在這片文章，我們將基於以下表和數據逐步的闡述PG中的全文搜索功能。

CREATE TABLE author(
   id SERIAL PRIMARY KEY,
   name TEXT NOT NULL
);

CREATE TABLE post(
   id SERIAL PRIMARY KEY,
   title TEXT NOT NULL,
   content TEXT NOT NULL,
   author_id INT NOT NULL references author(id) 
);

CREATE TABLE tag(
   id SERIAL PRIMARY KEY,
   name TEXT NOT NULL 
);

CREATE TABLE posts_tags(
   post_id INT NOT NULL references post(id),
   tag_id INT NOT NULL references tag(id)
 );

INSERT INTO author (id, name) 
VALUES (1, 'Pete Graham'), 
       (2, 'Rachid Belaid'), 
       (3, 'Robert Berry');

INSERT INTO tag (id, name) 
VALUES (1, 'scifi'), 
       (2, 'politics'), 
       (3, 'science');

INSERT INTO post (id, title, content, author_id) 
VALUES (1, 'Endangered species', 
        'Pandas are an endangered species', 1 ), 
       (2, 'Freedom of Speech', 
        'Freedom of speech is a necessary right', 2), 
       (3, 'Star Wars vs Star Trek', 
        'Few words from a big fan', 3);


INSERT INTO posts_tags (post_id, tag_id) 
VALUES (1, 3), 
       (2, 2), 
       (3, 1);

以上是一個比較典型的Blog類型的資料庫。有post, post有title和content。post通過外鍵與author建立關係。post也有自己的多個tag。

什麼是Full-Text Search全文檢索

首先讓我們看看定義

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. The full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.
-- Wikipedia

這段定義指出了一個重要的概念-document。當你執行一個查詢的時候，你實際上是為了尋找一些具有實際含義的實體。那些就是documents。PostgreSQL文檔解釋的有點耐人尋味

A document is the unit of searching in a full-text search system; for example, a magazine article or email message.
-- Postgres documentation

document 能夠跨越多個表，它代表的是一個邏輯上的獨立個體。

建立我們的document

在上一節我們介紹了document的概念，document和我們的表結構沒關係，它只和數據有關係。以我們的資料庫為例，一個document可以有以下幾個field組成：

post.title
post.content
該post的author.name
所有與該post相關的tag.name
通過以下查詢，我們可以建立一個document的二維表數據

SELECT post.title || ' ' || 
       post.content || ' ' ||
       author.name || ' ' ||
       coalesce((string_agg(tag.name, ' ')), '') as document
FROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id;

               document
 --------------------------------------------------
Endangered species Pandas are an endangered species Pete Graham politics
Freedom of Speech Freedom of speech is a necessary right Rachid Belaid politics
Star Wars vs Star Trek Few words from a big fan Robert Berry politics

在查詢語句中，我們通過post和author進行group，通過string_agg對tag.name進行聚合。同時我們用了coalesce來防止聚合之後tag出現null。
此時，我們的document是一個簡單的長字元串，當然對我們起不到什麼作用。我們需要通過to_tsvector()對這個長字元串操作一頓。

SELECT to_tsvector(post.title) || 
       to_tsvector(post.content) ||
       to_tsvector(author.name) ||
       to_tsvector(coalesce((string_agg(tag.name, ' ')), '')) as document
FROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id;
           
               document
 --------------------------------------------------
 'endang':1,6 'graham':9 'panda':3 'pete':8 'polit':10 'speci':2,7
 'belaid':12 'freedom':1,4  'necessari':9 'polit':13 'rachid':11 'right':10 'speech':3,6
 'berri':13 'big':10 'fan':11 'polit':14 'robert':12 'star':1,4 'trek':5 'vs':3 'war':2 'word':7
(3 rows)

這個查詢以tsvector的形式回傳我們的document，這種格式特別適合來做full-text search。讓我們在試一下將一個簡單的字元串轉化成tsvector類型

SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');

query will return the following result:

                             to_tsvector
----------------------------------------------------------------------
'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17
(1 row)

見證奇跡的時刻到啦。首先有一些單詞是來自於原句子的，但是有一些單詞變化了(try變成了tri)。同時所有單詞後面都帶了數字，為啥子?

一個tsvector是由一組不同的詞元組成。詞元是分詞之後的變種，就是被操作過了。操作的專業名詞叫做normalization。這波操作主包含大小寫轉化，尾碼移除等等。搜索意味著要面對多重選項，減少一些選項便於引擎去搜搜。單詞後面帶的數字表示詞元原來在的位置。比如man就是在6和15。您可以自個兒數數對不對。

預設情況下，Postgres在to_tsvector中用的是english的配置。它會直接忽略非英文的單詞。

Querying

現在，我們知道如何構建一個文檔。但是我們的目的是為了搜索。我們用@@來對tsvector的數據進行搜索。官方文檔走一波。讓我們看看一些例子：

> select to_tsvector('If you can dream it, you can do it') @@ 'dream';
 ?column?
----------
 f
(1 row)

> select to_tsvector('It''s kind of fun to do the impossible') @@ 'impossible';

 ?column?
----------
 t
(1 row)

上面主要是簡單的查一下是否存在這個關鍵詞。接下去的例子可以展示強制類型轉化和to_tsquery這個函數的區別。

SELECT 'impossible'::tsquery, to_tsquery('impossible');
   tsquery    | to_tsquery
--------------+------------
 'impossible' | 'imposs'
(1 row)

to_tsquery函數用於寫一些簡單的搜索語句。支持布爾操作符&(AND), |(OR), !(NOT)。在to_tsquery也可以用進行表達式優先順序的調整。

> SELECT to_tsvector('If the facts don't fit the theory, change the facts') @@ to_tsquery('! fact');

 ?column?
----------
 f
(1 row)

> SELECT to_tsvector('If the facts don''t fit the theory, change the facts') @@ to_tsquery('theory & !fact');

 ?column?
----------
 f
(1 row)

> SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('fiction | theory');

 ?column?
----------
 t
(1 row)

同時可以用:*進行startwith搜索。

> SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('theo:*');

 ?column?
----------
 t
(1 row)

現在我們知道如何做一個full-text search。我們回到我們的資料庫嘗試對我們文檔進行查詢。

SELECT pid, p_title
FROM (SELECT post.id as pid,
             post.title as p_title,
             to_tsvector(post.title) || 
             to_tsvector(post.content) ||
             to_tsvector(author.name) ||
             to_tsvector(coalesce(string_agg(tag.name, ' '))) as document
      FROM post
      JOIN author ON author.id = post.author_id
      JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
      JOIN tag ON tag.id = posts_tags.tag_id
      GROUP BY post.id, author.id) p_search
WHERE p_search.document @@ to_tsquery('Endangered & Species');

 pid |      p_title
-----+--------------------
   1 | Endangered species

語言支持

Posgres內置了一些語言: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.（沒看到中文，這段就不說了）

Accented Character

音調的不同導致意思也不同，（沒有內置中文，這段就不說了）

Ranking

構建一個搜索引擎，您會希望你的查詢結果是根據相關度來排序的。與documents的排名相關的指標在下麵這段引用中解釋的很清楚。

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur.
-- PostgreSQL documentation

在PostgreSQL中提供了兩個函數來進行相關度調整。分別是ts_rank() 和setweight()

函數setweight允許給予tsvector一個權重，權重的值可以是A,B,C,D。

SELECT pid, p_title
FROM (SELECT post.id as pid,
             post.title as p_title,
             setweight(to_tsvector(post.title), 'A') ||
             setweight(to_tsvector(post.content), 'B') ||
             setweight(to_tsvector('simple', author.name), 'C') ||
             setweight(to_tsvector('simple', coalesce(string_agg(tag.name, ' '))), 'B') as document
      FROM post
      JOIN author ON author.id = post.author_id
      JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
      JOIN tag ON tag.id = posts_tags.tag_id
      GROUP BY post.id, author.id) p_search
WHERE p_search.document @@ to_tsquery('english', 'Endangered & Species')
ORDER BY ts_rank(p_search.document, to_tsquery('english', 'Endangered & Species')) DESC;

在上面這個查詢中，我們給予不同的tsvector不同的權重，A權重的post.title比B權重的post.content更加重要。依此類推。這意味著如果我們查一個詞Alice。Adocument在標題中有Alice，Bdocument在內容中有Alice。那A文檔會在前。
ts_rank函數會回傳tsquery與tsvector的相關度，用一個小數來表示。

SELECT ts_rank(to_tsvector('This is an example of document'), 
               to_tsquery('example | document')) as relevancy;
 relevancy
-----------
 0.0607927
(1 row)


SELECT ts_rank(to_tsvector('This is an example of document'), 
               to_tsquery('example ')) as relevancy;
 relevancy
-----------
 0.0607927
(1 row)


SELECT ts_rank(to_tsvector('This is an example of document'), 
               to_tsquery('example | unkown')) as relevancy;
 relevancy
-----------
 0.0303964
(1 row)


SELECT ts_rank(to_tsvector('This is an example of document'),
               to_tsquery('example & document')) as relevancy;
 relevancy
-----------
 0.0985009
(1 row)


SELECT ts_rank(to_tsvector('This is an example of document'), 
               to_tsquery('example & unknown')) as relevancy;
 relevancy
-----------
 1e-20
(1 row)

但是，相關度這個概念實際上有點模糊且與實際APP需求緊密連接的。不同的APP需要不同的排名方式。你可以自己加上一些參數上去。比如你加一個根據作者年齡的排序，你可以ORDER BY ts_rank(p_search.document, to_tsquery('english', 'Endangered & Species')) * author.age DESC。最後怎麼算還是由你自己決定。

(Optimization and Indexing)優化和索引

優化搜索非常簡單明瞭。因為PostgreSQL支持基於Index的函數。所以你可以創建一個GIN index作用於tsvector函數。

CREATE INDEX idx_fts_post ON post 
USING gin((setweight(to_tsvector(language::regconfig, title),'A') || 
       setweight(to_tsvector(language::regconfig, content), 'B')));

-- If this throws an IMMUTABLE error then you can use this workaround

CREATE OR REPLACE FUNCTION gin_fts_fct(title text, content text, language text) 
  RETURNS tsvector
AS
$BODY$
    SELECT setweight(to_tsvector($3::regconfig, $1), 'A') || setweight(to_tsvector($3::regconfig, $1), 'B');
$BODY$
LANGUAGE sql
IMMUTABLE;

CREATE INDEX idx_fts_post ON post  USING gin(gin_fts_fct(title, content, language));

選擇GIN還是GiST索引？這要根據你的讀寫量以及數據量來決定

As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000 while GIN indexes will handle 100,000+ lexemes better but are slower to update.
-- Postgres doc : Chap 12 Full Text Search

在這個例子中，我們將採用GIN。但是你可以根據你的情況決定使用什麼數據。在document中，我們還有一個問題。document是跨表的，且各表欄位擁有不同權重。為了得到更好的性能，我們通過triggers或者materialized view的方式進行數據重組。不過您不需要總是重組數據。在一些情況下，您可以加一些基於index的函數。還可以通過tsvector_update_trigger(...)或者tsvector_update_trigger_column(...)來重組數據。查看文檔來獲取更多細節。另外如果可以接受短暫延遲的話。用Materialized View是一個不錯的選擇。

CREATE MATERIALIZED VIEW search_index AS 
SELECT post.id,
       post.title,
       setweight(to_tsvector(post.language::regconfig, post.title), 'A') || 
       setweight(to_tsvector(post.language::regconfig, post.content), 'B') ||
       setweight(to_tsvector('simple', author.name), 'C') ||
       setweight(to_tsvector('simple', coalesce(string_agg(tag.name, ' '))), 'A') as document
FROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id

然後可以通過REFRESH MATERIALIZED VIEW search_index;的方式來重建索引。
首先我們建立索引

CREATE INDEX idx_fts_search ON search_index USING gin(document);

然後查詢也會變得更加簡單

SELECT id as post_id, title
FROM search_index
WHERE document @@ to_tsquery('english', 'Endangered & Species')
ORDER BY ts_rank(p_search.document, to_tsquery('english', 'Endangered & Species')) DESC;

Mispelling

英文沒啥用，中文的分詞不知道實現了pg_trgm沒有。。

總結

Posgres全文搜索功能還是相當棒和快的。有了它，你就不用裝其他的搜索引擎了。Posgres像不像一顆幸福的子彈。。。也許不是，如果您的業務圍著搜索轉啊轉的話。
有一些功能我沒講，但是您基本上也用不到。以我個人而言，我希望Posgres的全文搜索還有一些功能能夠加進來。

更多的內置語言
和Lucene搞基
更好的排名和權重分配解決方案
模糊查詢再叼一點
總的來說，Posgres再全文搜索上肯定是沒有ElasticSearch和SOLR那麼先進。。。