Elasticsearch - How to guess important words in queries?

elasticsearch full-text-search precision booleanquery

You can use the regular "match" query and add a "cutoff_frequency" parameter. like:

{     "query": {           "match": {                "<field_name>": {                      "query": "PHP Developer",                      "operator": "AND",                      "cutoff_frequency": 0.001                }                           }     }}

That way, each term that appers in less then 0.1% of the documents - will be considered "important" and will be a "must" while the other terms will not be a "must" but only increase the score. "Developer" will be more common than "PHP" so that "PHP" will be a must but "Developer" will be optional but rated higher. Note that "PHP" might still be pretty common so you do need to fine-tune the right frequency!

elasticsearch full-text-search precision booleanquery

I don't think there is an easy answer. Depending on the amount of terms like developer you have you could do something like the Boosting query. You'd have to filter the terms from your search query and create the Boosting query.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html

A better approach might be to use the common terms query. In here you can give terms that are in a lot of the documents, "high frequency" terms, less importance. Using the low_freq_operator with AND could help you with what you want to accomplish.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html

elasticsearch full-text-search precision booleanquery

You could use a custom analyser for the field to make the tokens for the field always be consistent. In this case you could use a token filter of type "stop" (a stopwords filter) with "Developer" in the stopwords list (and anything else that should effectively be ignored). This will be applied to both the query and the data when indexed, so if you have "PHP Developer" in the index, and "PHP" in the query, they will both be turned into a token of "PHP" so they will be an exact match.

To make this more robust to different ways of typing "Developer", you would probably want to use a "lowercase" token filter as well, so the stopword would be "developer" instead.

You should note this will require reindexing the data.

The settings file will end up something like this:

{  "analysis": {     "filter": {        "job_stopwords": {            "type": "stop",            "stopwords": [                "developer", "dev"            ]        }     },     "analyzer": {        "job_analyzer": {            "type": "custom",            "filter": [                "lowercase", "job_stopwords"            ]        }     }}

Then you need to apply the job_analyzer analyser to the job field in the mapping for your document.

To have "Developer" increase the score of the hit, you could add a sub-field on the mapping for the field, which uses the default analyser. You could then "must" the job_analyzer and "should" the default analysed version.

Your mappings would look something like this:

{   "job_posting": {       "properties": {           "job_type": {               "type": "string",               "analyzer": "job_analyzer",               "fields": {                    "default": {                        "type": "string"                    }                }            }        }    }}

Your query would then be something like this:

{   "query": {       {          "bool": {             "must": {                 "match": {                      "job_type" : "PHP Developer"                 }             },             "should": {                 "match": {                      "job_type.default" : "PHP Developer"                 }             }          }      }   }}

Which will match "PHP Developer", "php dEv" and "PHP", but "PHP Developer" will get the highest score.

CodeHunter

Elasticsearch - How to guess important words in queries?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last