elasticsearch custom tokenizer - split token by length elasticsearch custom tokenizer - split token by length elasticsearch elasticsearch

elasticsearch custom tokenizer - split token by length


For your requirement, if you can't do it using the pattern tokenizer then you'll need to code up a custom Lucene Tokenizer class yourself. You can create a custom Elasticsearch plugin for it. You can refer to this for examples about how Elasticsearch plugins are created for custom analyzers.


The Pattern Tokenizer supports a parameter "group"

It has a default of "-1", which means to use the pattern for splitting, which is what you saw.

However by defining a group >= 0 in your pattern and setting the group-parameter this can be done! E.g. the following tokenizer will split the input into 4-character tokens:

PUT my_index{  "settings": {    "analysis": {      "analyzer": {        "my_analyzer": {          "tokenizer": "my_tokenizer"        }      },      "tokenizer": {        "my_tokenizer": {          "type": "pattern",          "pattern": "(.{4})",          "group": "1"        }      }    }  }}

Analyzing a document via the following:

POST my_index/_analyze{  "analyzer": "my_analyzer",  "text": "comma,separated,values"}

Results in the following tokens:

{  "tokens": [    {      "token": "comm",      "start_offset": 0,      "end_offset": 4,      "type": "word",      "position": 0    },    {      "token": "a,se",      "start_offset": 4,      "end_offset": 8,      "type": "word",      "position": 1    },    {      "token": "para",      "start_offset": 8,      "end_offset": 12,      "type": "word",      "position": 2    },    {      "token": "ted,",      "start_offset": 12,      "end_offset": 16,      "type": "word",      "position": 3    },    {      "token": "valu",      "start_offset": 16,      "end_offset": 20,      "type": "word",      "position": 4    }  ]}