How to make elasticsearch scoring take field-length into account How to make elasticsearch scoring take field-length into account elasticsearch elasticsearch

How to make elasticsearch scoring take field-length into account


As I described in this answer scoring/relevance is not the easiest topic in Elasticsearch.

I was trying to figure out solution for you and currently I have something like that.

Documents:

{ "tags": [ { "topics": ["music", "festival", "dance", "techno", "germany"]} ], "topics_count": 5 }{ "tags": [ { "topics": ["music", "festival", "dance", "techno"]} ], "topics_count": 4 }{ "tags": [ { "topics": ["music", "festival", "dance"] } ], "topics_count": 3 }{ "tags": [ { "topics": ["music", "festival"]} ], "topics_count": 2 }{ "tags": [ { "topics": ["music"]} ], "topics_count": 1 }

and query:

{  "query": {    "bool": {      "should": [        {          "function_score": {            "query": {              "terms_set": {                "tags.topics" : {                  "terms" : ["music", "festival"],                  "minimum_should_match_script": {                    "source": "params.num_terms"                  }                }              }            },            "script_score" : {              "script" : {                "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"              }            }          }        },        {          "function_score": {            "query": {              "terms_set": {                "tags.topics" : {                 "terms" : ["music", "festival"],                 "minimum_should_match_script": {                    "source": "doc['topics_count'].value"                  }                }              }            },            "script_score" : {              "script" : {                "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"              }            }          }        }      ]    }  }}

It is not perfect. Still needs some improvements. It works well (tested on ES 6.2) for ["music", "festival"] and ["music", "dance"] on this example but I'm guessing that on other results it will work not 100% as you expected. Mostly because of the relevance/scoring complexity. But you can now read more about things I used and try to improve it.