Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:

{  "query": {    "bool": {            "must" : [                        {          "match": {            "name": "HI"          }        }            ],      "should": [        {          "prefix": {            "name": "HI"          }        }      ]    }  },     "highlight": {    "pre_tags": [      "<"    ],    "post_tags": [      ">"    ],    "fields": {      "name": {                "highlight_query": {                        "match": {            "name": "HI"          }                }            }    }  }}

elasticsearch search n-gram

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:

{  "query": {    "bool": {      "should": [        {          "match": {            "name": "HI"          }        },        {          "match_phrase_prefix": {            "name": "HI"          }        }      ]    }  }}

The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

elasticsearch search n-gram

A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.

The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.

In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.

Here is an example mapping and query for your case:

PUT /tag/{    "settings": {        "analysis": {            "analyzer": {                "edge_analyzer": {                    "tokenizer": "edge_tokenizer"                },                "kw_analyzer": {                    "tokenizer": "kw_tokenizer"                },                "ngram_analyzer": {                    "tokenizer": "ngram_tokenizer"                },                "autocomplete_analyzer": {                    "tokenizer": "autocomplete_tokenizer",                    "filter": [                        "standard"                    ]                },                "autocomplete_search": {                    "tokenizer": "whitespace"                }            },            "tokenizer": {                "kw_tokenizer": {                    "type": "keyword"                },                "edge_tokenizer": {                    "type": "edge_ngram",                    "min_gram": 2,                    "max_gram": 10                },                "ngram_tokenizer": {                    "type": "ngram",                    "min_gram": 2,                    "max_gram": 10,                    "token_chars": [                        "letter",                        "digit"                    ]                },                "autocomplete_tokenizer": {                    "type": "edge_ngram",                    "min_gram": 1,                    "max_gram": 10,                    "token_chars": [                        "letter",                        "symbol"                    ]                }            }        }    },    "mappings": {        "tag": {            "properties": {                "id": {                    "type": "long"                },                "name": {                    "type": "text",                    "fields": {                        "edge": {                            "type": "text",                            "analyzer": "edge_analyzer"                        },                        "ngram": {                            "type": "text",                            "analyzer": "ngram_analyzer"                        }                    }                }            }        }    }}

And a query:

POST /tag/_search{    "query": {        "bool": {            "should": [                {                "function_score": {                    "query": {                        "match": {                            "name.edge": {                                "query": "HI"                            }                        }                    },                    "boost": "5",                    "boost_mode": "multiply"                }                },                {                    "match": {                        "name.ngram": {                            "query": "HI"                        }                    }                },                {                    "match": {                        "name": {                            "query": "HI"                        }                    }                }            ]        }    }}

CodeHunter

Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last