ElasticSearch 分词

1. 什么是分词器

切分词语,normalization(提升recall召回率)

给你一段句子,然后将这段句子拆分成一个一个的单个的单词,同时对每个单词进行normalization(时态转换,单复数转换),分词器
recall,召回率:搜索的时候,增加能够搜索到的结果的数量

character filter:在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(hello —> hello),& —> and(I&you —> I and you)

tokenizer:分词,hello you and me —> hello, you, and, me
token filter:lowercase,stop word,synonymom,dogs —> dog,liked —> like,Tom —> tom,a/the/an —> 干掉,mother —> mom,small —> little

一个分词器,很重要,将一段文本进行各种处理,最后处理好的结果才会拿去建立倒排索引

2. 分词流程

3. 内置分词器的介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默认的是standard)

simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans

whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

language analyzer(特定的语言的分词器,比如说,english,英语分词器):set, shape, semi, transpar, call, set_tran, 5

4. pinyin拼音分词

pinyin分词github地址:https://github.com/medcl/elasticsearch-analysis-pinyin

注意:也支持将繁体转为拼音

4.1. 下载安装

直接到github下载即可,注意pinyin分词的版本与ElasticSearch版本要对应

解压后,重命名目录,移动到ElasticSearch的plugins目录下

1
2
mv elasticsearch analysis-pinyin
mv analysis-pinyin /usr/local/elasticsearch/plugins/

重启ElasticSearch即可生效

4.2. keep_first_letter

取每个中文的拼音首字母,串在一起

  • 中华人民共和国 ==> zhrmghg

4.3. keep_full_pinyin

将每个中文转为拼音,单独分开

  • 中华人民共和国 ==> [zhong, hua, ren, men, gong, he, guo]

4.4. keep_joined_full_pinyin

将每个中文转为拼音,串在一起

  • 中华人民共和国 ==> zhonghuarenmengongheguo

4.5. keep_separate_first_letter

取每个中文的拼音首字母,单独分开

  • 中华人民共和国 ==> [z, h, r, m, g, h, g]

limit_first_letter_length

“limit_first_letter_length” = 3 && “keep_first_letter” = true

  • 中华人民共和国 ==> zhr

5. 分词API

5.1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
GET _analyze
{
"text": ["中华人民共和国"],
"analyzer": "ik_max_word"
}

GET _analyze
{
"text": ["中华人民共和国"],
"analyzer": "pinyin"
}

GET _analyze
{
"text": ["中华人民ghg"],
"analyzer": "pinyin"
}

6. 自定义分词

官方参考:https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis-custom-analyzer.html

6.1. 基本的自定义分词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PUT myindex
{
"settings": {
"analysis": {
"analyzer": {
// 名称随便取
"my_analyzer": {
"type": "custom",
// 指定char_filter
"char_filter": ["html_strip"],
// 指定tokenizer
"tokenizer": "standard",
// 指定token_filter
"filter": ["lowercase", "asciifolding"]
}
}
}
}
}

GET /myindex/_analyze
{
"text": ["<h1>zhejiang university</h1>"],
"analyzer": "my_analyzer"
}

6.2. 使用自定义的char_filter/tokenizer/token_filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
PUT myindex
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["html_strip", "my_char_filter"],
"tokenizer": "my_tokenizer",
"filter": ["lowercase", "asciifolding", "my_english_stop_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
":)=> _hadppy_",
":(=> _sad_"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[,.!?]"
}
},
"filter": {
"my_english_stop_filter": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}

GET myindex/_analyze
{
"text": ["I'm a :) and :( person!"],
"analyzer": "my_analyzer"
}

7. 中文拼音混合搜索建议

参考:

中文拼音混合:

  • tokenizer用ik,token filter用pinyin_filter
  • tokenizer用ik_smart,不必用ik_max_word

7.1. 创建中拼混合分词器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
PUT test_index
{
"settings": {
"number_of_shards": "1",
"index": {
"analysis": {
"analyzer": {
"ik_smart_pinyin_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": "pinyin_filter"
}
},
"filter": {
"pinyin_filter": {
"type": "pinyin"
}
}
}
}
}
}

DELETE test_index

PUT test_index/_mapping/test_type
{
"properties": {
"lawbasis":{
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"my_pinyin":{
"type":"text",
"analyzer": "ik_smart_pinyin_analyzer"
}
}
}
}
}

POST test_index/_analyze
{
"text":"中华人民共和国",
"analyzer": "ik_smart_pinyin_analyzer"
}

POST test_index/_analyze
{
"text":"道路挖掘",
"analyzer": "ik_smart_pinyin_analyzer"
}

7.2. 搜索测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
POST test_index/test_type
{
"lawbasis":"道路挖掘"
}

POST test_index/test_type
{
"lawbasis":"道路施工"
}

GET test_index/test_type/_search
{
"query":{
"match": {
"lawbasis.my_pinyin": "sg"
}
}
}

7.3. 对于fields的理解

官方说明:https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

博文翻译:https://blog.csdn.net/qq_32165041/article/details/83688593

panchaoxin wechat
关注我的公众号
支持一下