ElasticSearch多字段特性&自定义Analyzer¶

多字段特性¶

以不同的特性索引字段来实现不同的需求，即多字段的特性。

JSON

PUT products
{
    "mappings":{
        "properties":{
            "company":{
                "type":"text",
                "fields":{
                    "keyword":{
                        "type":"keyword",
                        "ignore_above":256,
                    }
                }
            },
            "comment":{
                "type":"text",
                "fields":{
                    "english_comment":{
                        "type":"text",
                        "analyzer":"english",
                        "search_anlyzer":"english"
                    }
                }
            }
        }
    }
}

ExactValues&FullText¶

Exact Value：包括数字/日期/具体一个字符串
keyword
Full text：全文本，非结构化的文本数据
text

其中 Exact Value 不需要被分词，会为每一个字段创建一个倒排索引。

自定义分词¶

可以通过不同的组合实现自定义的分词器：

Character Filter

在 Tokenizer 之前对文本进行处理，可配置多个，且会影响 Tokenizer 的 position 和 offset 信息。

自带的 Character Filter：HTML strip（去除 html 标签）、Mapping（字符串替换）、Pattern replace（正则替换）。

JSON

POST _analyze
{
  "tokenizer":"keyword",
  "char_filter":["html_strip"],
  "text": "<b>hello world</b>"
}

#返回如下
{
  "tokens" : [
    {
      "token" : "hello world",
      "start_offset" : 3,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}


#使用char filter进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test"
}

#返回如下：
{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

//正则表达式
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "pattern_replace",
        "pattern" : "http://(.*)",
        "replacement" : "$1"
      }
    ],
    "text" : "http://www.elastic.co"
}

#返回如下：
{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

Tokenizer

将原始的文本按照一定的规则，切分为词（term or token）。

Elasticsearch 内置的 Tokenizer
whitespace/standard/uax_url email/pattern/keyword/path hierarchy

可以用官方提供的库自己开发。

JSON

POST _analyze
{
  "tokenizer":"path_hierarchy",
  "text":"/user/e/a"
}

#返回如下
{
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/e",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/e/a",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    }
  ]
}

TokenFilter

将 Tokenizer 输出的单词（term），进行增加、修改、删除。

Elasticsearch 内置的：
Lowercase/stop/synonym（添加近义词）

JSON

// white space and snowball
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}

#返回如下：
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "gilr",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "play",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game!",
      "start_offset" : 36,
      "end_offset" : 41,
      "type" : "word",
      "position" : 7
    }
  ]
}


//remove 加入lowercase后，The被当成 stopword删除
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}

#返回如下：
{
  "tokens" : [
    {
      "token" : "gilr",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "china",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "play",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game!",
      "start_offset" : 36,
      "end_offset" : 41,
      "type" : "word",
      "position" : 7
    }
  ]
}

参考¶

https://time.geekbang.org/course/intro/100030501?tab=catalog