Elasticsearch - How to guess important words in queries? Elasticsearch - How to guess important words in queries? elasticsearch elasticsearch

Elasticsearch - How to guess important words in queries?


You can use the regular "match" query and add a "cutoff_frequency" parameter. like:

{     "query": {           "match": {                "<field_name>": {                      "query": "PHP Developer",                      "operator": "AND",                      "cutoff_frequency": 0.001                }                           }     }}

That way, each term that appers in less then 0.1% of the documents - will be considered "important" and will be a "must" while the other terms will not be a "must" but only increase the score. "Developer" will be more common than "PHP" so that "PHP" will be a must but "Developer" will be optional but rated higher. Note that "PHP" might still be pretty common so you do need to fine-tune the right frequency!


I don't think there is an easy answer. Depending on the amount of terms like developer you have you could do something like the Boosting query. You'd have to filter the terms from your search query and create the Boosting query.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html

A better approach might be to use the common terms query. In here you can give terms that are in a lot of the documents, "high frequency" terms, less importance. Using the low_freq_operator with AND could help you with what you want to accomplish.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html


You could use a custom analyser for the field to make the tokens for the field always be consistent. In this case you could use a token filter of type "stop" (a stopwords filter) with "Developer" in the stopwords list (and anything else that should effectively be ignored). This will be applied to both the query and the data when indexed, so if you have "PHP Developer" in the index, and "PHP" in the query, they will both be turned into a token of "PHP" so they will be an exact match.

To make this more robust to different ways of typing "Developer", you would probably want to use a "lowercase" token filter as well, so the stopword would be "developer" instead.

You should note this will require reindexing the data.

The settings file will end up something like this:

{  "analysis": {     "filter": {        "job_stopwords": {            "type": "stop",            "stopwords": [                "developer", "dev"            ]        }     },     "analyzer": {        "job_analyzer": {            "type": "custom",            "filter": [                "lowercase", "job_stopwords"            ]        }     }}

Then you need to apply the job_analyzer analyser to the job field in the mapping for your document.

To have "Developer" increase the score of the hit, you could add a sub-field on the mapping for the field, which uses the default analyser. You could then "must" the job_analyzer and "should" the default analysed version.

Your mappings would look something like this:

{   "job_posting": {       "properties": {           "job_type": {               "type": "string",               "analyzer": "job_analyzer",               "fields": {                    "default": {                        "type": "string"                    }                }            }        }    }}

Your query would then be something like this:

{   "query": {       {          "bool": {             "must": {                 "match": {                      "job_type" : "PHP Developer"                 }             },             "should": {                 "match": {                      "job_type.default" : "PHP Developer"                 }             }          }      }   }}

Which will match "PHP Developer", "php dEv" and "PHP", but "PHP Developer" will get the highest score.