elasticsearch custom tokenizer - split token by length
For your requirement, if you can't do it using the pattern tokenizer then you'll need to code up a custom Lucene Tokenizer class yourself. You can create a custom Elasticsearch plugin for it. You can refer to this for examples about how Elasticsearch plugins are created for custom analyzers.
The Pattern Tokenizer supports a parameter "group"
It has a default of "-1", which means to use the pattern for splitting, which is what you saw.
However by defining a group >= 0 in your pattern and setting the group-parameter this can be done! E.g. the following tokenizer will split the input into 4-character tokens:
PUT my_index{ "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "(.{4})", "group": "1" } } } }}
Analyzing a document via the following:
POST my_index/_analyze{ "analyzer": "my_analyzer", "text": "comma,separated,values"}
Results in the following tokens:
{ "tokens": [ { "token": "comm", "start_offset": 0, "end_offset": 4, "type": "word", "position": 0 }, { "token": "a,se", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "para", "start_offset": 8, "end_offset": 12, "type": "word", "position": 2 }, { "token": "ted,", "start_offset": 12, "end_offset": 16, "type": "word", "position": 3 }, { "token": "valu", "start_offset": 16, "end_offset": 20, "type": "word", "position": 4 } ]}