How to perform an exact match query on an analyzed field in Elasticsearch?

elasticsearch spring-data-elasticsearch exact-match

I think this will do what you want (or at least come as close as is possible), using the keyword tokenizer and lowercase token filter:

PUT /test_index{   "settings": {      "analysis": {         "analyzer": {            "lowercase_analyzer": {               "type": "custom",               "tokenizer": "keyword",               "filter": ["lowercase_token_filter"]            }         },         "filter": {            "lowercase_token_filter": {               "type": "lowercase"            }         }      }   },   "mappings": {      "doc": {         "properties": {            "text_field": {               "type": "string",               "fields": {                  "raw": {                     "type": "string",                     "index": "not_analyzed"                  },                  "lowercase": {                     "type": "string",                     "analyzer": "lowercase_analyzer"                  }               }            }         }      }   }}

I added a couple of docs for testing:

POST /test_index/doc/_bulk{"index":{"_id":1}}{"text_field":"super duper cool pizza"}{"index":{"_id":2}}{"text_field":"some other text"}{"index":{"_id":3}}{"text_field":"pizza"}

Notice we have the outer text_field set to be analyzed by the standard analyzer, then a sub-field raw that's not_analyzed (you may not want this one, I just added it for comparison), and another sub-field lowercase that creates tokens exactly the same as the input text, except that they have been lowercased (but not split on whitespace). So this match query returns what you expected:

POST /test_index/_search{    "query": {        "match": {           "text_field.lowercase": "Super Duper COOL PIzza"        }    }}...{   "took": 3,   "timed_out": false,   "_shards": {      "total": 5,      "successful": 5,      "failed": 0   },   "hits": {      "total": 1,      "max_score": 0.30685282,      "hits": [         {            "_index": "test_index",            "_type": "doc",            "_id": "1",            "_score": 0.30685282,            "_source": {               "text_field": "super duper cool pizza"            }         }      ]   }}

Remember that the match query will use the field's analyzer against the search phrase as well, so in this case searching for "super duper cool pizza" would have exactly the same effect as searching for "Super Duper COOL PIzza" (you could still use a term query if you want an exact match).

It's useful to take a look at the terms generated in each field by the three documents, since this is what your search queries will be working against (in this case raw and lowercase have the same tokens, but that's only because all the inputs were lower-case already):

POST /test_index/_search{   "size": 0,   "aggs": {      "text_field_standard": {         "terms": {            "field": "text_field"         }      },      "text_field_raw": {         "terms": {            "field": "text_field.raw"         }      },      "text_field_lowercase": {         "terms": {            "field": "text_field.lowercase"         }      }   }}...{   "took": 26,   "timed_out": false,   "_shards": {      "total": 5,      "successful": 5,      "failed": 0   },   "hits": {      "total": 3,      "max_score": 0,      "hits": []   },   "aggregations": {      "text_field_raw": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "pizza",               "doc_count": 1            },            {               "key": "some other text",               "doc_count": 1            },            {               "key": "super duper cool pizza",               "doc_count": 1            }         ]      },      "text_field_lowercase": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "pizza",               "doc_count": 1            },            {               "key": "some other text",               "doc_count": 1            },            {               "key": "super duper cool pizza",               "doc_count": 1            }         ]      },      "text_field_standard": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "pizza",               "doc_count": 2            },            {               "key": "cool",               "doc_count": 1            },            {               "key": "duper",               "doc_count": 1            },            {               "key": "other",               "doc_count": 1            },            {               "key": "some",               "doc_count": 1            },            {               "key": "super",               "doc_count": 1            },            {               "key": "text",               "doc_count": 1            }         ]      }   }}

Here's the code I used to test this out:

http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1

If you also want to do partial word matching, I would encourage you to take a look at ngrams. I wrote up an introduction for Qbox here:

https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

CodeHunter

How to perform an exact match query on an analyzed field in Elasticsearch?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last