How to wisely combine shingles and edgeNgram to provide flexible full text search?

regex elasticsearch lucene odata analyzer

This is an interesting use case. Here's my take:

{  "settings": {    "analysis": {      "analyzer": {        "my_ngram_analyzer": {          "tokenizer": "my_ngram_tokenizer",          "filter": ["lowercase"]        },        "my_edge_ngram_analyzer": {          "tokenizer": "my_edge_ngram_tokenizer",          "filter": ["lowercase"]        },        "my_reverse_edge_ngram_analyzer": {          "tokenizer": "keyword",          "filter" : ["lowercase","reverse","substring","reverse"]        },        "lowercase_keyword": {          "type": "custom",          "filter": ["lowercase"],          "tokenizer": "keyword"        }      },      "tokenizer": {        "my_ngram_tokenizer": {          "type": "nGram",          "min_gram": "2",          "max_gram": "25"        },        "my_edge_ngram_tokenizer": {          "type": "edgeNGram",          "min_gram": "2",          "max_gram": "25"        }      },      "filter": {        "substring": {          "type": "edgeNGram",          "min_gram": 2,          "max_gram": 25        }      }    }  },  "mappings": {    "test_type": {      "properties": {        "text": {          "type": "string",          "analyzer": "my_ngram_analyzer",          "fields": {            "starts_with": {              "type": "string",              "analyzer": "my_edge_ngram_analyzer"            },            "ends_with": {              "type": "string",              "analyzer": "my_reverse_edge_ngram_analyzer"            },            "exact_case_insensitive_match": {              "type": "string",              "analyzer": "lowercase_keyword"            }          }        }      }    }  }}

my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:

{  "query": {    "term": {      "text": {        "value": "table 1"      }    }  }}

my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:

{  "query": {    "term": {      "text.starts_with": {        "value": "table 1"      }    }  }}

I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram).The query:

{  "query": {    "term": {      "text.ends_with": {        "value": "table 1"      }    }  }}

for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do itThe query:

{  "query": {    "term": {      "text.exact_case_insensitive_match": {        "value": "table 1"      }    }  }}

Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.

But this can be "simulated" with query_string if the appropriate analyzer is specified for it.

The solution would be a set of queries like the following (always use that analyzer, changing only the field name):

{  "query": {    "query_string": {      "query": "text.starts_with:(\"table 1\")",      "analyzer": "lowercase_keyword"    }  }}

CodeHunter

How to wisely combine shingles and edgeNgram to provide flexible full text search?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last