Elasticsearch 分詞器_ZenDei技術網路在線

無論是內置的分析器（analyzer），還是自定義的分析器（analyzer），都由三種構件塊組成的：character filters ， tokenizers ， token filters。內置的analyzer將這些構建塊預先打包到適合不同語言和文本類型的analyzer中。 Charac ...

無論是內置的分析器（analyzer），還是自定義的分析器（analyzer），都由三種構件塊組成的：character filters ， tokenizers ， token filters。

內置的analyzer將這些構建塊預先打包到適合不同語言和文本類型的analyzer中。

Character filters （字元過濾器）

字元過濾器以字元流的形式接收原始文本，並可以通過添加、刪除或更改字元來轉換該流。

舉例來說，一個字元過濾器可以用來把阿拉伯數字（٠‎١٢٣٤٥٦٧٨‎٩）‎轉成成Arabic-Latin的等價物（0123456789）。

一個分析器可能有0個或多個字元過濾器，它們按順序應用。

（PS：類似Servlet中的過濾器，或者攔截器，想象一下有一個過濾器鏈）

Tokenizer （分詞器）

一個分詞器接收一個字元流，並將其拆分成單個token （通常是單個單詞），並輸出一個token流。例如，一個whitespace分詞器當它看到空白的時候就會將文本拆分成token。它會將文本“Quick brown fox!”轉換為[Quick, brown, fox!]

（PS：Tokenizer 負責將文本拆分成單個token ，這裡token就指的就是一個一個的單詞。就是一段文本被分割成好幾部分，相當於Java中的字元串的 split ）

分詞器還負責記錄每個term的順序或位置，以及該term所表示的原單詞的開始和結束字元偏移量。（PS：文本被分詞後的輸出是一個term數組）

一個分析器必須只能有一個分詞器

Token filters （token過濾器）

token過濾器接收token流，並且可能會添加、刪除或更改tokens。

例如，一個lowercase token filter可以將所有的token轉成小寫。stop token filter可以刪除常用的單詞，比如 the 。synonym token filter可以將同義詞引入token流。

不允許token過濾器更改每個token的位置或字元偏移量。

一個分析器可能有0個或多個token過濾器，它們按順序應用。

小結&回顧

analyzer（分析器）是一個包，這個包由三部分組成，分別是：character filters （字元過濾器）、tokenizer（分詞器）、token filters（token過濾器）

一個analyzer可以有0個或多個character filters

一個analyzer有且只能有一個tokenizer

一個analyzer可以有0個或多個token filters

character filter 是做字元轉換的，它接收的是文本字元流，輸出也是字元流

tokenizer 是做分詞的，它接收字元流，輸出token流（文本拆分後變成一個一個單詞，這些單詞叫token）

token filter 是做token過濾的，它接收token流，輸出也是token流

由此可見，整個analyzer要做的事情就是將文本拆分成單個單詞，文本 ----> 字元 ----> token

這就好比是攔截器

1. 測試分析器

analyze API 是一個工具，可以幫助我們查看分析的過程。（PS：類似於執行計劃）

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}
'

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
'

輸出：

{
    "tokens":[
        {
            "token":"The",
            "start_offset":0,
            "end_offset":3,
            "type":"word",
            "position":0
        },
        {
            "token":"quick",
            "start_offset":4,
            "end_offset":9,
            "type":"word",
            "position":1
        },
        {
            "token":"brown",
            "start_offset":10,
            "end_offset":15,
            "type":"word",
            "position":2
        },
        {
            "token":"fox.",
            "start_offset":16,
            "end_offset":20,
            "type":"word",
            "position":3
        }
    ]
}

可以看到，對於每個term，記錄了它的位置和偏移量

2. Analyzer

2.1. 配置內置的分析器

內置的分析器不用任何配置就可以直接使用。當然，預設配置是可以更改的。例如，standard分析器可以配置為支持停止字列表:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type":     "text",
          "analyzer": "standard", 
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "std_english" 
            }
          }
        }
      }
    }
  }
}
'

在這個例子中，我們基於standard分析器來定義了一個std_englisth分析器，同時配置為刪除預定義的英語停止詞列表。後面的mapping中，定義了my_text欄位用standard，my_text.english用std_english分析器。因此，下麵兩個的分詞結果會是這樣的：

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text", 
  "text": "The old brown cow"
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text.english", 
  "text": "The old brown cow"
}
'

第一個由於用的standard分析器，因此分詞的結果是：[ the, old, brown, cow ]

第二個用std_english分析的結果是：[ old, brown, cow ]

2.2. Standard Analyzer （預設）

如果沒有特別指定的話，standard 是預設的分析器。它提供了基於語法的標記化（基於Unicode文本分割演算法），適用於大多數語言。

例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

上面例子中，那段文本將會輸出如下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

2.2.1. 配置

標準分析器接受下列參數：

max_token_length ：最大token長度，預設255
stopwords ：預定義的停止詞列表，如_english_ 或包含停止詞列表的數組，預設是 _none_
stopwords_path ：包含停止詞的文件路徑

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

以上輸出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

2.2.3. 定義

standard分析器由下列兩部分組成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （預設被禁用）

你還可以自定義

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}
'

2.3. Simple Analyzer

simple 分析器當它遇到只要不是字母的字元，就將文本解析成term，而且所有的term都是小寫的。例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸入結果如下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定義

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [         
          ]
        }
      }
    }
  }
}
'

2.4. Whitespace Analyzer

whitespace 分析器，當它遇到空白字元時，就將文本解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸出結果如下：

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了對刪除停止詞的支持。預設用的停止詞是 _englisht_

（PS：意思是，假設有一句話“this is a apple”，並且假設“this” 和 “is”都是停止詞，那麼用simple的話輸出會是[ this , is , a , apple ]，而用stop輸出的結果會是[ a , apple ]，到這裡就看出二者的區別了，stop 不會輸出停止詞，也就是說它不認為停止詞是一個term）

（PS：所謂的停止詞，可以理解為分隔符）

2.5.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
    "analyzer": "stop",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受以下參數：

stopwords ：一個預定義的停止詞列表（比如，_englisht_）或者是一個包含停止詞的列表。預設是 _english_
stopwords_path ：包含停止詞的文件路徑。這個路徑是相對於Elasticsearch的config目錄的一個路徑

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}
'

上面配置了一個stop分析器，它的停止詞有兩個：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

基於以上配置，這個請求輸入會是這樣的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正則表達式來將文本分割成terms，預設的正則表達式是\W+（非單詞字元）

2.6.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

由於預設按照非單詞字元分割，因此輸出會是這樣的：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受如下參數：

pattern ：一個Java正則表達式，預設 \W+
flags ： Java正則表達式flags。比如：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否將terms全部轉成小寫。預設true
stopwords ：一個預定義的停止詞列表，或者包含停止詞的一個列表。預設是 _none_
stopwords_path ：停止詞文件路徑

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}
'

上面的例子中配置了按照非單詞字元或者下劃線分割，並且輸出的term都是小寫

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_email_analyzer",
  "text": "[email protected]"
}
'

因此，基於以上配置，本例輸出如下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支持不同語言環境下的文本分析。內置（預定義）的語言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定義Analyzer

前面也說過，一個分析器由三部分構成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 實例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}
'

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

4. 中文分詞器

4.1. smartCN

一個簡單的中文或中英文混合文本的分詞器

這個插件提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，而且不需要配置

# 安裝
bin/elasticsearch-plugin install analysis-smartcn
# 卸載
bin/elasticsearch-plugin remove analysis-smartcn

下麵測試一下

可以看到，“今天天氣真好”用smartcn分析器的結果是：

[ 今天 ， 天氣 ， 真 ， 好 ]

如果用standard分析器的話，結果會是：

[ 今 ，天 ，氣 ， 真 ， 好 ]

4.2. IK分詞器

下載對應的版本，這裡我下載6.5.3

然後，在Elasticsearch的plugins目錄下建一個ik目錄，將剛纔下載的文件解壓到該目錄下

最後，重啟Elasticsearch

接下來，還是用剛纔那句話來測試一下

輸出結果如下：

{
    "tokens": [
        {
            "token": "今天天氣",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "天天",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "天氣",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "真好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}