大數據、elasticsearch、實時搜索、search_as_you_type、Completion Suggester、查詢優化、首碼匹配、中綴匹配 ...
簡述
業務開發中經常會遇到這樣一種情況,用戶在搜索框輸入時要實時展示搜索相關的結果。要實現這個場景常用的方案有Completion Suggester、search_as_you_type。那麼這兩種方式有什麼區別呢?一起來瞭解下。
環境說明:
數據量:9000w+
es版本:7.10.1
腳本執行工具:kibana
Completion Suggester和search_as_you_type的區別
1.Completion Suggester是基於首碼匹配、且數據結構存儲在記憶體中,超級快,缺點是耗記憶體
2.search_as_you_type可以是首碼、中綴匹配,可以很快,但是要選好查詢方式
3.Api調用方式不同,Completion Suggester是通過Suggest語句查詢,search_as_you_type和常規查詢方式一致
舉個慄子
如何實現首碼匹配需求
使用Completion Suggester,示例如下:
- 創建索引
PUT /es_demo
{
"mappings": {
"properties": {
"title_comp": {
"type": "completion",
"analyzer": "standard"
}
}
}
}
- 初始化數據
POST _bulk
{"index":{"_index":"es_demo","_id":"1"}}
{"title_comp": "憤怒的小鳥"}
{"index":{"_index":"es_demo","_id":"2"}}
{"title_comp": "最後一隻渡渡鳥"}
{"index":{"_index":"es_demo","_id":"3"}}
{"title_comp": "今天不加班啊"}
{"index":{"_index":"es_demo","_id":"4"}}
{"title_comp": "憤怒的青年"}
{"index":{"_index":"es_demo","_id":"5"}}
{"title_comp": "最後一隻996程式猿"}
{"index":{"_index":"es_demo","_id":"6"}}
{"title_comp": "今日無事,勾欄聽曲"}
- 查詢DSL
通過首碼查詢,查找以“憤怒”開頭的字元串
GET /es_demo/_search
{
"suggest": {
"title_suggest": {
"prefix": "憤怒",
"completion": {
"field": "title_comp"
}
}
}
}
- 查詢代碼demo
@SpringBootTest
public class SuggestTest {
@Autowired
private RestHighLevelClient restHighLevelClient;
@Test
public void testComp() {
List<Map<String, Object>> list = suggestComplete("憤怒");
list.forEach(m -> System.out.println("[" + m.get("title_comp") + "]"));
}
public List<Map<String, Object>> suggestComplete(String keyword) {
CompletionSuggestionBuilder completionSuggestionBuilder = SuggestBuilders.completionSuggestion("title_comp");
completionSuggestionBuilder.size(5)
//跳過重覆的
.skipDuplicates(true);
SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("suggest_title", completionSuggestionBuilder)
.setGlobalText(keyword);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.suggest(suggestBuilder);
SearchRequest searchRequest = new SearchRequest("es_demo").source(searchSourceBuilder);
try {
SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion("suggest_title");
List<Map<String, Object>> suggestList = new LinkedList<>();
for (CompletionSuggestion.Entry.Option option : completionSuggestion.getOptions()) {
Map<String, Object> map = new HashMap<>();
map.put("title_comp", option.getHit().getSourceAsMap().get("title_comp"));
suggestList.add(map);
}
return suggestList;
} catch (IOException e) {
throw new RuntimeException("ES查詢出錯");
}
}
}
查詢結果:
[憤怒的小鳥]
[憤怒的青年]
如何實現中綴匹配需求
使用search_as_you_type,此處提供了hanlp_index和standard兩種分詞器的欄位示例。示例如下:
- 創建索引
PUT /es_search_as_you_type
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"han": {
"type": "search_as_you_type",
"analyzer": "hanlp_index"
},
"stan": {
"type": "search_as_you_type",
"analyzer": "standard"
}
}
}
}
}
}
- 初始化數據
POST _bulk
{"index":{"_index":"es_search_as_you_type","_id":"1"}}
{"title": "憤怒的小鳥"}
{"index":{"_index":"es_search_as_you_type","_id":"2"}}
{"title": "最後一隻渡渡鳥"}
{"index":{"_index":"es_search_as_you_type","_id":"3"}}
{"title": "今天不加班啊"}
{"index":{"_index":"es_search_as_you_type","_id":"4"}}
{"title": "憤怒的青年"}
{"index":{"_index":"es_search_as_you_type","_id":"5"}}
{"title": "最後一隻996程式猿"}
{"index":{"_index":"es_search_as_you_type","_id":"6"}}
{"title": "今日無事,勾欄聽曲"}
- 查詢DSL
GET /es_search_as_you_type/_search
{
"query": {
"match": {
"title.stan": {
"query": "的小",
"operator": "and"
}
}
}
}
- 查詢代碼demo
@SpringBootTest
public class SuggestTest {
@Autowired
private RestHighLevelClient restHighLevelClient;
@Test
public void testSearchAsYouType() {
List<Map<String, Object>> list = suggestSearchAsYouType("的小");
list.forEach(m -> System.out.println("[" + m.get("title") + "]"));
}
public List<Map<String, Object>> suggestSearchAsYouType(String keyword) {
//這裡使用了search_as_you_type的2gram欄位,可以根據自己需求調整配置
MatchQueryBuilder matchQueryBuilder = matchQuery("title.stan._2gram", keyword).operator(Operator.AND);
//需要返回的欄位
String[] includeFields = new String[]{"title"};
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
.query(matchQueryBuilder).size(5)
.fetchSource(includeFields, null)
.trackTotalHits(false)
.trackScores(true)
.sort(SortBuilders.scoreSort());
SearchRequest searchRequest = new SearchRequest("es_search_as_you_type").source(searchSourceBuilder);
try {
SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
org.elasticsearch.search.SearchHits hits = response.getHits();
List<Map<String, Object>> suggestList = new LinkedList<>();
for (org.elasticsearch.search.SearchHit hit : hits) {
Map<String, Object> map = new HashMap<>();
map.put("title", hit.getSourceAsMap().get("title").toString());
suggestList.add(map);
}
return suggestList;
} catch (IOException e) {
throw new RuntimeException("ES查詢出錯");
}
}
}
查詢結果:
[憤怒的小鳥]
分詞器說明
查看分詞結果的方式
第一種
指定分詞器
GET _analyze
{
"analyzer": "standard",
"text": [
"憤怒的小鳥"
]
}
第二種
指定使用某個欄位的分詞器
POST es_search_as_you_type/_analyze
{
"field": "title.stan",
"text": [
"憤怒的青年"
]
}
hanlp_index和standard分詞器的區別
standard分詞器
- 預設會過濾掉符號
- 中文以單個字為最小單位,英文則會以空格符或其他符號或中文分隔作為一個單詞
例:
GET _analyze
{
"analyzer": "standard",
"text": [
"憤怒的小鳥"
]
}
分詞結果:
{
"tokens" : [
{
"token" : "憤",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "怒",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "的",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "小",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "鳥",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
hanlp_index分詞器
- 預設不會過濾符號
- 通過語義等對字元串進行分詞,會分出詞語
例:
GET _analyze
{
"analyzer": "hanlp_index",
"text": [
"憤怒的小鳥"
]
}
分詞結果:
{
"tokens" : [
{
"token" : "憤怒",
"start_offset" : 0,
"end_offset" : 2,
"type" : "a",
"position" : 0
},
{
"token" : "的",
"start_offset" : 2,
"end_offset" : 3,
"type" : "ude1",
"position" : 1
},
{
"token" : "小鳥",
"start_offset" : 3,
"end_offset" : 5,
"type" : "n",
"position" : 2
}
]
}
生產實踐中的查詢情況
基本都是幾百毫秒就解決。ps:如果一條數據欄位很多,最好只返回幾個需要的欄位即可,否則數據傳輸就要占用較多時間。
總結
當然,無論是Completion Suggester還是search_as_you_type的查詢配置方式都還有很多,例如Completion Suggester的Context Suggester,search_as_you_type的2gram、3gram,還有查詢類型match_bool_prefix、match_phrase、match_phrase_prefix等等。各種組合起來都會產生不同的效果,筆者這裡只是列舉出一種還算可以的方式。關於其他的查詢類型和配置如何使用以及分別是怎麼工作的,下次有空再聊聊。
官方文檔鏈接
https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-as-you-type.html