How to wisely combine shingles and edgeNgram to provide flexible full text search? How to wisely combine shingles and edgeNgram to provide flexible full text search? elasticsearch elasticsearch

How to wisely combine shingles and edgeNgram to provide flexible full text search?


This is an interesting use case. Here's my take:

{  "settings": {    "analysis": {      "analyzer": {        "my_ngram_analyzer": {          "tokenizer": "my_ngram_tokenizer",          "filter": ["lowercase"]        },        "my_edge_ngram_analyzer": {          "tokenizer": "my_edge_ngram_tokenizer",          "filter": ["lowercase"]        },        "my_reverse_edge_ngram_analyzer": {          "tokenizer": "keyword",          "filter" : ["lowercase","reverse","substring","reverse"]        },        "lowercase_keyword": {          "type": "custom",          "filter": ["lowercase"],          "tokenizer": "keyword"        }      },      "tokenizer": {        "my_ngram_tokenizer": {          "type": "nGram",          "min_gram": "2",          "max_gram": "25"        },        "my_edge_ngram_tokenizer": {          "type": "edgeNGram",          "min_gram": "2",          "max_gram": "25"        }      },      "filter": {        "substring": {          "type": "edgeNGram",          "min_gram": 2,          "max_gram": 25        }      }    }  },  "mappings": {    "test_type": {      "properties": {        "text": {          "type": "string",          "analyzer": "my_ngram_analyzer",          "fields": {            "starts_with": {              "type": "string",              "analyzer": "my_edge_ngram_analyzer"            },            "ends_with": {              "type": "string",              "analyzer": "my_reverse_edge_ngram_analyzer"            },            "exact_case_insensitive_match": {              "type": "string",              "analyzer": "lowercase_keyword"            }          }        }      }    }  }}
  • my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:
{  "query": {    "term": {      "text": {        "value": "table 1"      }    }  }}
  • my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:
{  "query": {    "term": {      "text.starts_with": {        "value": "table 1"      }    }  }}
  • I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram).The query:
{  "query": {    "term": {      "text.ends_with": {        "value": "table 1"      }    }  }}
  • for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do itThe query:
{  "query": {    "term": {      "text.exact_case_insensitive_match": {        "value": "table 1"      }    }  }}

Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.

But this can be "simulated" with query_string if the appropriate analyzer is specified for it.

The solution would be a set of queries like the following (always use that analyzer, changing only the field name):

{  "query": {    "query_string": {      "query": "text.starts_with:(\"table 1\")",      "analyzer": "lowercase_keyword"    }  }}