Regexp starts with not working Elasticsearch 6.* Regexp starts with not working Elasticsearch 6.* elasticsearch elasticsearch

Regexp starts with not working Elasticsearch 6.*


It looks like you are using text datatype to store Unit.DailyAvailability (which is also the default one for strings if you are using dynamic mapping). You should consider using keyword datatype instead.

Let me explain in a bit more detail.

Why does my regex match something in the middle of a text field?

What happens with text datatype is that the data gets analyzed for full-text search. It does some transformations like lowercasing and splitting into tokens.

Let's try to use the Analyze API against your input:

POST _analyze{  "text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"}

The response is:

{  "tokens": [    {      "token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",      "start_offset": 0,      "end_offset": 255,      "type": "<ALPHANUM>",      "position": 0    },    {      "token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",      "start_offset": 255,      "end_offset": 510,      "type": "<ALPHANUM>",      "position": 1    },    {      "token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",      "start_offset": 510,      "end_offset": 732,      "type": "<ALPHANUM>",      "position": 2    }  ]}

As you can see, Elasticsearch has split your input into three tokens and lowercased them. This looks unexpected, but if you think that it actually tries to facilitate search for words in human language, it makes sense - there are no such long words.

That's why now regexp query ".{7}a{7}.*" will match: there is a token that actually starts with a lot of a's, which is an expected behavior of regexp query.

...Elasticsearch will apply the regexp to the terms produced by thetokenizer for that field, and not to the original text of the field.

How can I make regexp query consider the entire string?

It is very simple: do not apply analyzers. The type keyword stores the string you provide as is.

With a mapping like this:

PUT my_regexes{  "mappings": {    "doc": {      "properties": {        "Unit": {          "properties": {            "DailyAvailablity": {              "type": "keyword"            }          }        }      }    }  }}

You will be able to do a query like this that will match the document from the post:

POST my_regexes/doc/_search{ "query": {   "bool": {     "filter": [        {         "regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*"  }        }      ]    }  }}

Note that the query became case-sensitive because the field is not analyzed.

This regexp won't return any results anymore: ".{12}a{7}.*"

This will: ".{12}A{7}.*"

So what about anchoring?

The regexes are anchored:

Lucene’s patterns are always anchored. The pattern provided must match the entire string.

The reason why it looked like the anchoring was wrong was most likely because tokens got split in an analyzed text field.


Just in addition to brilliant and helpfull answer of Nikolay Vasiliev. In my case I was forced to go farther to make it work on NEST .net. I added attribute mapping to DailyAvailability:

[Keyword(Name = "DailyAvailability")]public string DailyAvailability { get; set; }

The filter still didn't work and I got mapping:

 "DailyAvailability":"type":"text",     "fields":{           "keyword":{               "type":"keyword",             "ignore_above":256         }      } }

My field contained about 732 symbols so it was ignored by index. I tried:

[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]public string DailyAvailability { get; set; }

It didn't make any difference on mapping. And only after adding manual mappings it started working properly:

var client = new ElasticClient(settings);client.CreateIndex("vrp", c => c    .Mappings(ms => ms.Map<Unit>(m => m        .Properties(ps => ps            .Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))        )     )  ));

The point is that:

ignore_above - Do not index any string longer than this value. Defaults to 2147483647 so that all values would be accepted. Please however note that default dynamic mapping rules create a sub keyword field that overrides this default by setting ignore_above: 256.

So use explicit mapping for long keyword fields to set ignore_above if you need to filter them with regexp.


For anyone could be useful, the ES tool does not support the \d \w modes, you should write those as [0-9] and [a-z]