How to make elasticsearch scoring take field-length into account
As I described in this answer scoring/relevance is not the easiest topic in Elasticsearch.
I was trying to figure out solution for you and currently I have something like that.
Documents:
{ "tags": [ { "topics": ["music", "festival", "dance", "techno", "germany"]} ], "topics_count": 5 }{ "tags": [ { "topics": ["music", "festival", "dance", "techno"]} ], "topics_count": 4 }{ "tags": [ { "topics": ["music", "festival", "dance"] } ], "topics_count": 3 }{ "tags": [ { "topics": ["music", "festival"]} ], "topics_count": 2 }{ "tags": [ { "topics": ["music"]} ], "topics_count": 1 }
and query:
{ "query": { "bool": { "should": [ { "function_score": { "query": { "terms_set": { "tags.topics" : { "terms" : ["music", "festival"], "minimum_should_match_script": { "source": "params.num_terms" } } } }, "script_score" : { "script" : { "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)" } } } }, { "function_score": { "query": { "terms_set": { "tags.topics" : { "terms" : ["music", "festival"], "minimum_should_match_script": { "source": "doc['topics_count'].value" } } } }, "script_score" : { "script" : { "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)" } } } } ] } }}
It is not perfect. Still needs some improvements. It works well (tested on ES 6.2) for ["music", "festival"]
and ["music", "dance"]
on this example but I'm guessing that on other results it will work not 100% as you expected. Mostly because of the relevance/scoring complexity. But you can now read more about things I used and try to improve it.