How to perform an exact match query on an analyzed field in Elasticsearch? How to perform an exact match query on an analyzed field in Elasticsearch? elasticsearch elasticsearch

How to perform an exact match query on an analyzed field in Elasticsearch?


I think this will do what you want (or at least come as close as is possible), using the keyword tokenizer and lowercase token filter:

PUT /test_index{   "settings": {      "analysis": {         "analyzer": {            "lowercase_analyzer": {               "type": "custom",               "tokenizer": "keyword",               "filter": ["lowercase_token_filter"]            }         },         "filter": {            "lowercase_token_filter": {               "type": "lowercase"            }         }      }   },   "mappings": {      "doc": {         "properties": {            "text_field": {               "type": "string",               "fields": {                  "raw": {                     "type": "string",                     "index": "not_analyzed"                  },                  "lowercase": {                     "type": "string",                     "analyzer": "lowercase_analyzer"                  }               }            }         }      }   }}

I added a couple of docs for testing:

POST /test_index/doc/_bulk{"index":{"_id":1}}{"text_field":"super duper cool pizza"}{"index":{"_id":2}}{"text_field":"some other text"}{"index":{"_id":3}}{"text_field":"pizza"}

Notice we have the outer text_field set to be analyzed by the standard analyzer, then a sub-field raw that's not_analyzed (you may not want this one, I just added it for comparison), and another sub-field lowercase that creates tokens exactly the same as the input text, except that they have been lowercased (but not split on whitespace). So this match query returns what you expected:

POST /test_index/_search{    "query": {        "match": {           "text_field.lowercase": "Super Duper COOL PIzza"        }    }}...{   "took": 3,   "timed_out": false,   "_shards": {      "total": 5,      "successful": 5,      "failed": 0   },   "hits": {      "total": 1,      "max_score": 0.30685282,      "hits": [         {            "_index": "test_index",            "_type": "doc",            "_id": "1",            "_score": 0.30685282,            "_source": {               "text_field": "super duper cool pizza"            }         }      ]   }}

Remember that the match query will use the field's analyzer against the search phrase as well, so in this case searching for "super duper cool pizza" would have exactly the same effect as searching for "Super Duper COOL PIzza" (you could still use a term query if you want an exact match).

It's useful to take a look at the terms generated in each field by the three documents, since this is what your search queries will be working against (in this case raw and lowercase have the same tokens, but that's only because all the inputs were lower-case already):

POST /test_index/_search{   "size": 0,   "aggs": {      "text_field_standard": {         "terms": {            "field": "text_field"         }      },      "text_field_raw": {         "terms": {            "field": "text_field.raw"         }      },      "text_field_lowercase": {         "terms": {            "field": "text_field.lowercase"         }      }   }}...{   "took": 26,   "timed_out": false,   "_shards": {      "total": 5,      "successful": 5,      "failed": 0   },   "hits": {      "total": 3,      "max_score": 0,      "hits": []   },   "aggregations": {      "text_field_raw": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "pizza",               "doc_count": 1            },            {               "key": "some other text",               "doc_count": 1            },            {               "key": "super duper cool pizza",               "doc_count": 1            }         ]      },      "text_field_lowercase": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "pizza",               "doc_count": 1            },            {               "key": "some other text",               "doc_count": 1            },            {               "key": "super duper cool pizza",               "doc_count": 1            }         ]      },      "text_field_standard": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "pizza",               "doc_count": 2            },            {               "key": "cool",               "doc_count": 1            },            {               "key": "duper",               "doc_count": 1            },            {               "key": "other",               "doc_count": 1            },            {               "key": "some",               "doc_count": 1            },            {               "key": "super",               "doc_count": 1            },            {               "key": "text",               "doc_count": 1            }         ]      }   }}

Here's the code I used to test this out:

http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1

If you also want to do partial word matching, I would encourage you to take a look at ngrams. I wrote up an introduction for Qbox here:

https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch